## Profitable App Profiles for the App Store and Google Play Markets
- Goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users


- A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).



In [82]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [83]:
from csv import reader

opened_file_apple = open('AppleStore.csv')
read_file_apple = reader(opened_file_apple)
apps_data_apple = list(read_file_apple)

opened_file_google = open('googleplaystore.csv')
read_file_google = reader(opened_file_google)
apps_data_google = list(read_file_google)

In [84]:
explore_data(apps_data_apple, 0, 2, True)
explore_data(apps_data_google, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [85]:
print(apps_data_google[10473])
print(len(apps_data_google[10472]))
del apps_data_google[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
13


In [86]:
print(apps_data_google[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


### Finding duplicate entries

Some apps has more than one entry (for instance, Instagram has 4). There are in total 1181 cases of duplicates in our data set. 
We don't want to have duplicated during our analysis, so we need to remove redundant entries. 

In [87]:
duplicate_apps = []
unique_apps = []

for app in apps_data_google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of unique apps: ', len(unique_apps))
print('Number of duplicated apps: ', len(duplicate_apps))

for app in apps_data_google:
    if app[0] == 'Instagram':
        print(app)
    

Number of unique apps:  9660
Number of duplicated apps:  1181
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


### Remove duplicate entries
If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)


In [88]:
reviews_max = {}

for app in apps_data_google[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


In [89]:
android_clean = []
already_added = []

for app in apps_data_google[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))

9659


### Removing non-English apps

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [90]:
print(apps_data_apple[813][1])
print(apps_data_apple[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

BATTLE BEARS -1
Beast Poker
中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [91]:
def is_english(name):
    counter = 0
    for char in name:
        if ord(char) > 127:
            counter += 1
    
    if counter <= 3:
        return True
    else:
        return False

print(is_english(apps_data_apple[813][1]))
print(is_english(android_clean[4412][0]))
print(is_english('Docs To Go™ Free Office Suite'))

True
False
True


Below, we use the `is_english()` function to filter out the non-English apps for both data sets:

In [92]:
ios_english = []
androind_english = []

for app in apps_data_apple[1:]:
    if is_english(app[1]):
        ios_english.append(app)
        
for app in android_clean[1:]:
    if is_english(app[0]):
        androind_english.append(app)
        
print(len(ios_english))
print(len(androind_english))
        
        

6183
9613


### Isolating the free apps


In [93]:
def is_free(price):
    if price == 0.00:
        return True
    else:
        return False

In [94]:
ios_free = []
android_free = []

for app in ios_english:
    if app[4] == '0.0':
        ios_free.append(app)
        
for app in androind_english:
    if app[7] == '0':
        android_free.append(app)       
        
print(len(ios_free))
print(len(android_free))




3222
8863


### Most common apps by genre

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [95]:
print(apps_data_apple[0])
print(apps_data_google[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [96]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [97]:
def freq_table(dataset, index):
    freq_table = {}
    for app in dataset:
        value = app[index]
        if value not in freq_table:
            freq_table[value] = 1
        else:
            freq_table[value] += 1
#     print(freq_table)        
    return freq_table

In [98]:
print('iOS by prime_genre')
display_table(ios_free, 11)
print('\n')
print('Android by Genres')
display_table(android_free, 9)
print('\n')
print('Android by Category')
display_table(android_free, 1)

iOS by prime_genre
Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


Android by Genres
Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 6

The frequency tables above shows us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

### Most popular apps by genre on AppStore

In [99]:
freq_table_genre = freq_table(ios_free, 11)

for genre in freq_table_genre:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    
    avg_no_ratings = total / len_genre
    print('Genre: ', genre, ', average number of ratings: ', avg_no_ratings)

Genre:  Social Networking , average number of ratings:  71548.34905660378
Genre:  Photo & Video , average number of ratings:  28441.54375
Genre:  Games , average number of ratings:  22788.6696905016
Genre:  Music , average number of ratings:  57326.530303030304
Genre:  Reference , average number of ratings:  74942.11111111111
Genre:  Health & Fitness , average number of ratings:  23298.015384615384
Genre:  Weather , average number of ratings:  52279.892857142855
Genre:  Utilities , average number of ratings:  18684.456790123455
Genre:  Travel , average number of ratings:  28243.8
Genre:  Shopping , average number of ratings:  26919.690476190477
Genre:  News , average number of ratings:  21248.023255813954
Genre:  Navigation , average number of ratings:  86090.33333333333
Genre:  Lifestyle , average number of ratings:  16485.764705882353
Genre:  Entertainment , average number of ratings:  14029.830708661417
Genre:  Food & Drink , average number of ratings:  33333.92307692308
Genre:  Spo

On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together.

The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating

However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

- Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

- Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

- Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.


### Most popular apps by genre on Google Play

In [100]:
print('Android by Installs')
display_table(android_free, 5)

Android by Installs
1,000,000+ : 1394
100,000+ : 1024
10,000,000+ : 935
10,000+ : 903
1,000+ : 744
100+ : 613
5,000,000+ : 605
500,000+ : 493
50,000+ : 423
5,000+ : 400
10+ : 314
500+ : 288
50,000,000+ : 204
100,000,000+ : 189
50+ : 170
5+ : 70
1+ : 45
500,000,000+ : 24
1,000,000,000+ : 20
0+ : 4
0 : 1


In [101]:
freq_table_genre_android = freq_table(android_free, 1)

for category in freq_table_genre_android:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            len_category += 1
            installs = float(app[5].replace(',', '').replace('+', ''))
            total += installs
            
    avg_no_installs = total / len_category
    print('Genre: ', category, ', average number of installs: ', avg_no_installs)

Genre:  ART_AND_DESIGN , average number of installs:  2021626.7857142857
Genre:  AUTO_AND_VEHICLES , average number of installs:  647317.8170731707
Genre:  BEAUTY , average number of installs:  513151.88679245283
Genre:  BOOKS_AND_REFERENCE , average number of installs:  8767811.894736841
Genre:  BUSINESS , average number of installs:  1712290.1474201474
Genre:  COMICS , average number of installs:  817657.2727272727
Genre:  COMMUNICATION , average number of installs:  38456119.167247385
Genre:  DATING , average number of installs:  854028.8303030303
Genre:  EDUCATION , average number of installs:  1833495.145631068
Genre:  ENTERTAINMENT , average number of installs:  11640705.88235294
Genre:  EVENTS , average number of installs:  253542.22222222222
Genre:  FINANCE , average number of installs:  1387692.475609756
Genre:  FOOD_AND_DRINK , average number of installs:  1924897.7363636363
Genre:  HEALTH_AND_FITNESS , average number of installs:  4188821.9853479853
Genre:  HOUSE_AND_HOME , 

- COMMUNICATION 38 M 
- ENTERTAINMENT 11M
- GAME 15M
- SOCIAL 23M
- PHOTOGRAPHY 17M
- TRAVEL_AND_LOCAL 13M
- PRODUCTIVITY 16M
- VIDEO_PLAYERS 24M (but youtube)
