This project is about finding the profitable app profiles for the Apple Store and Google Play Store 

# Analyzing Mobile App Data Project

In [1]:
from csv import reader
app_store = open('AppleStore.csv', encoding="utf8")
goog_play = open('googleplaystore.csv', encoding="utf8")

In [2]:
app_store = reader(app_store)
app_store = list(app_store)
goog_play = reader(goog_play)
goog_play = list(goog_play)

# Opening and Exploring the Data

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(app_store, 0, 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




In [5]:
len(app_store[0])

16

In [6]:
len(app_store)

7198

# Deleting Wrong Data

we should check if all the rows has equal number of entries, We already know the main row has 13 items so we would find the lesser ones and delete them.

In [7]:
def error_finder(dataset):
    for idx, item in enumerate(dataset[1:]):
        if len(item) < len(dataset[0]):
            print(f'Number of rows for row {idx} is {len(item)}')
            print(idx, goog_play[idx])

In [8]:
error_finder(goog_play)

Number of rows for row 10472 is 12
10472 ['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


In [9]:
len(goog_play[10473])

12

In [10]:
del goog_play[10473]

In [11]:
error_finder(goog_play)

The google play dataset has some duplicates in some rows so we need to find and delete them.

In [12]:
explore_data(goog_play, 0, 3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




In [13]:
goog_play = goog_play[1:]
app_store = app_store[1:]

# Removing Duplicate Entries: Part One

In [14]:
def duplicate_finder(col, dataset):
    unique_apps = []
    duplicate_apps = []
    for row in dataset[1:]:
        entry = row[col]
        if entry in unique_apps:
            duplicate_apps.append(entry)
        else:
            unique_apps.append(entry)
    
    print(f'Some of duplicate apps are {duplicate_apps[:15]} and they are {len(duplicate_apps)}')
#     print(f'Number of unique apps of column {dataset[col]} are {len(unique_apps)}')
#     return duplicate_apps

In [15]:
# These are the first item in rows which are duplicates
duplicate_finder(0, goog_play)

Some of duplicate apps are ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software'] and they are 1181


In [16]:
for item in goog_play:
    if item[0] == 'Quick PDF Scanner + OCR FREE':
        print(item)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


# Removing Duplicate Entries: Part Two

We won't remove the duplicates randomly, the criterion that we're going to use is: \
The fourth item in each row corresponds to the number of comments for each app, so the higher number of comments the latest version the app is. Therefore for each duplicate we can filter out a row with the latest version of the information.

Now, we want to remove the duplicates

We would first make a dictionary of unique items and their corresponding number of reviews. That if an app name is already in the dictionary and their current number of reviews is less than their duplicated app name number of reviews then their number of reviews would be updated.\
\
For the if statement condition if we use it with else clause then if both the criterias are not met (i.e. the app name is not in the max_reviews dictionary and also the max_reviews dictionary is not less than that of current reading number of reviews) the app wont be added to the dictionary, Therefore, we need to place adding the new dicitonary item for new reviews just out of the if statement and that's it. \
\
So, in this way we would make a unique list of app names with their corresponding maximum number of reviews.

In [17]:
reviews_max = {}
for row in goog_play:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    reviews_max[name] = n_reviews
    
print(f'count of apps with their reviews at maximum available count are: {len(reviews_max)}')

count of apps with their reviews at maximum available count are: 9659


In [18]:
print('Lenght of dataset list without thos duplicates: ', len(goog_play) - 1181)
print('Lenght of apps with their reviews count at maximum available amount: ', len(reviews_max))

Lenght of dataset list without thos duplicates:  9659
Lenght of apps with their reviews count at maximum available amount:  9659


We would use the previous dictionary of unique items with the maximum number of reviews to make a clean list of such app names with all their information. \

We see that the count of apps with max reviews the reviews_max dataset and the count of cleaned list of apps with no duplicates the android_clean are the same.

In [19]:
android_clean = []
already_added = []
for row in goog_play:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(f'count of google not duplicate apps are: {len(android_clean)}')

count of google not duplicate apps are: 9659


# Removing Non-English Apps

Next we want to filter those app names that are not English

The previous criterion to drop non-English apps couldn't correctly identify them as some english apps use emoji's or trademark sign so if we used that we would lose useful data since many English apps will be invorrectly labeled as non-English.\
\
To minimize the data loss we would consider only removing apps that has three non-English characters

In [20]:
def Non_English_Char_dropper(word):
    counter = 0
    for char in word:
        if ord(char) > 127:
            counter += 1
    if counter > 3:
        return False
    else:
        return True

In [21]:
Non_English_Char_dropper('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

Now we would define a non-English app cleaner function and apply it to both of our datasets, the latest filtered google and the app store

In [22]:
def Non_English_app_cleaner(dataset, col_num):
    updated_list = []
    for row in dataset:
        name = row[col_num]
        if Non_English_Char_dropper(name):
            updated_list.append(row)
    return updated_list

In [23]:
android_clean_English = Non_English_app_cleaner(android_clean, 0)
print(f'count of google English apps are: {len(android_clean_English)}')

count of google English apps are: 9614


In [24]:
app_store_English = Non_English_app_cleaner(app_store, 1)
print(f'count of apple English apps are: {len(app_store_English)}')

count of apple English apps are: 6183


# Isolating the Free Apps

We're only looking for free apps too, so we would filter them here

In [25]:
def free_apps(dataset, col_num):
    free_apps = []
    for row in dataset:
        price = row[col_num]
        if price == '0' or price == '0.0':
            free_apps.append(row)
    return free_apps

In [26]:
android_clean_English_free = free_apps(android_clean_English, 7)
print(f'count of google free apps are: {len(android_clean_English_free)}')

count of google free apps are: 8864


In [27]:
app_store_English_free = free_apps(app_store_English, 4)
print(f'count of apple free apps are: {len(app_store_English_free)}')

count of apple free apps are: 3222


# Most Common Apps by Genre

Now, we need to find out what genre of apps are successful in both google play and apple store, so we can build our app in that category and be successful in money generation faster, therefore we would need to make a frequency table for each of stores to understand this.

Also, we can make a common_genre dictionary so for each genre we can see what are the number of apps in both of Android and Apple stores.

Also we can sort out the genre dictionaries based on number of apps genre

In [28]:
def freq_table(dataset, col_num):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        col_data = row[col_num]
        if col_data in table:
            table[col_data] += 1
        else: 
            table[col_data] = 1
    
    freq_table_dic = {}
# To get percentages for each genre uncomment these two lines
    for key, value in table.items():
        freq_table_dic[key] = (value/total) * 100
        
    return freq_table_dic

In [29]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse=True)
    for item in table_sorted:
        print(item[1], ' is ', item[0])

In [30]:
# prime_genre
apple_apps_prime_genre_sorted_percent = dict(sorted(freq_table(app_store_English_free, 11)\
                                .items(), key=lambda x:x[1], reverse=True))
print(f'the frequency table for the genre of apple store apps is as follows:')
for item in apple_apps_prime_genre_sorted_percent.items():
    print(f'{item}') 

the frequency table for the genre of apple store apps is as follows:
('Games', 58.16263190564867)
('Entertainment', 7.883302296710118)
('Photo & Video', 4.9658597144630665)
('Education', 3.662321539416512)
('Social Networking', 3.2898820608317814)
('Shopping', 2.60707635009311)
('Utilities', 2.5139664804469275)
('Sports', 2.1415270018621975)
('Music', 2.0484171322160147)
('Health & Fitness', 2.0173805090006205)
('Productivity', 1.7380509000620732)
('Lifestyle', 1.5828677839851024)
('News', 1.3345747982619491)
('Travel', 1.2414649286157666)
('Finance', 1.1173184357541899)
('Weather', 0.8690254500310366)
('Food & Drink', 0.8069522036002483)
('Reference', 0.5586592178770949)
('Business', 0.5276225946617008)
('Book', 0.4345127250155183)
('Navigation', 0.186219739292365)
('Medical', 0.186219739292365)
('Catalogs', 0.12414649286157665)


As a possibility here comes next context regarding some new intuitions, till we investigate it further

The list was sorted. So, the most common prime genre was Games with 58.16%, and the next one was Education with only 7%, and there were far less than that number of apps in the stores; that's a massive drop of number of apps. Accordingly, most of apps like games according to their prime genre were just wasting of time for the customers.

 Generally, most of apps were designed just for gaming and entertainment, according to the number of apps in this genre it showed gaming was a lot more important for most of Apple customers than anything else.

We can say gaming industry is booming, while it wouldn't mean that all apps in that genre have a large number of users, but it does mean that large number of users do pay attention to that specific genre unless otherwise wouldn't that many number of apps been made by the industry in such genre.

If we wanna compare common apps between both stores to compare them side by side

In [31]:
# category
android_apps_categories_sorted_percent = dict(sorted(freq_table(android_clean_English_free, 1)\
                                   .items(), key=lambda x:x[1], reverse=True))
print(f'the frequency table for the categories of google apps is as follows:')
for item in android_apps_categories_sorted_percent.items():
    print(f'{item}')

the frequency table for the categories of google apps is as follows:
('FAMILY', 19.223826714801444)
('GAME', 9.510379061371841)
('TOOLS', 8.461191335740072)
('BUSINESS', 4.580324909747293)
('LIFESTYLE', 3.9034296028880866)
('PRODUCTIVITY', 3.892148014440433)
('FINANCE', 3.7003610108303246)
('MEDICAL', 3.5424187725631766)
('SPORTS', 3.4183212996389893)
('PERSONALIZATION', 3.3167870036101084)
('COMMUNICATION', 3.2490974729241873)
('HEALTH_AND_FITNESS', 3.068592057761733)
('PHOTOGRAPHY', 2.944494584837545)
('NEWS_AND_MAGAZINES', 2.7978339350180503)
('SOCIAL', 2.6624548736462095)
('TRAVEL_AND_LOCAL', 2.33528880866426)
('SHOPPING', 2.2450361010830324)
('BOOKS_AND_REFERENCE', 2.1435018050541514)
('DATING', 1.861462093862816)
('VIDEO_PLAYERS', 1.782490974729242)
('MAPS_AND_NAVIGATION', 1.3989169675090252)
('FOOD_AND_DRINK', 1.2409747292418771)
('EDUCATION', 1.128158844765343)
('LIBRARIES_AND_DEMO', 0.9363718411552346)
('AUTO_AND_VEHICLES', 0.9250902527075812)
('ENTERTAINMENT', 0.8799638989169

It showed Android users had far more diverse taste in category of the apps that they use.

The most common category was family, game, tools, business, lifestyle and many others, that's interesting, If we consider apps on their cellphones as their main source of information and time usage. It showed most of Android users are professional people who are family oriented, care about themselves their family business and lifestyle, pay least attention to Events, they think they know enough about parenting, comics, art and design, and beauty. Also, if we consider the fact that the most of gamers are kids, It also showed that Android users are a lot more mature than Apple users.

In [32]:
android_apps_genre_sorted_percent = dict(sorted(freq_table(android_clean_English_free, 9)\
                                   .items(), key=lambda x:x[1], reverse=True))
print(f'the frequency table for the genre of google apps is as follows:')
for item in android_apps_genre_sorted_percent.items():
    print(item)

the frequency table for the genre of google apps is as follows:
('Tools', 8.449909747292418)
('Entertainment', 6.069494584837545)
('Education', 5.347472924187725)
('Business', 4.580324909747293)
('Lifestyle', 3.892148014440433)
('Productivity', 3.892148014440433)
('Finance', 3.7003610108303246)
('Medical', 3.5424187725631766)
('Sports', 3.463447653429603)
('Personalization', 3.3167870036101084)
('Communication', 3.2490974729241873)
('Action', 3.1024368231046933)
('Health & Fitness', 3.068592057761733)
('Photography', 2.944494584837545)
('News & Magazines', 2.7978339350180503)
('Social', 2.6624548736462095)
('Travel & Local', 2.3240072202166067)
('Shopping', 2.2450361010830324)
('Books & Reference', 2.1435018050541514)
('Simulation', 2.0419675090252705)
('Dating', 1.861462093862816)
('Arcade', 1.861462093862816)
('Video Players & Editors', 1.782490974729242)
('Casual', 1.7486462093862816)
('Maps & Navigation', 1.3989169675090252)
('Food & Drink', 1.2409747292418771)
('Puzzle', 1.1281588

If we compare genre of android apps and apple store apps we can see that android users are a lot less active in gaming, and it confirms our last intuition.

In [33]:
common_genres = {}
for key, value in android_apps_genre_sorted_percent.items():
    if key in apple_apps_prime_genre_sorted_percent:
        common_genres[key] = [value, apple_apps_prime_genre_sorted_percent[key]]
print(f'the frequency table for apps presenting in both of the stores,\
the first android store, and the second is apple store:\n\n {common_genres}')

the frequency table for apps presenting in both of the stores,the first android store, and the second is apple store:

 {'Entertainment': [6.069494584837545, 7.883302296710118], 'Education': [5.347472924187725, 3.662321539416512], 'Business': [4.580324909747293, 0.5276225946617008], 'Lifestyle': [3.892148014440433, 1.5828677839851024], 'Productivity': [3.892148014440433, 1.7380509000620732], 'Finance': [3.7003610108303246, 1.1173184357541899], 'Medical': [3.5424187725631766, 0.186219739292365], 'Sports': [3.463447653429603, 2.1415270018621975], 'Health & Fitness': [3.068592057761733, 2.0173805090006205], 'Shopping': [2.2450361010830324, 2.60707635009311], 'Food & Drink': [1.2409747292418771, 0.8069522036002483], 'Weather': [0.8009927797833934, 0.8690254500310366], 'Music': [0.2030685920577617, 2.0484171322160147]}


# Most Popular Apps by Genre on the App Store

We saw apps made for fun dominate the apple store, while Google play shows a more balanced 

We need to find out what genres are the most popular (i.e. have the most users)

One way to investigate is to calculate the average number of installs for each app genre. For google play dataset we have an Installs column which we can use. While we don't have it for the Apple store so we can use the total number of user ratings as a proxy, that is in the rating_count_tot column.

1-Let's Isolate the apps of each genre.

2-Add up the user ratings for the apps of that genre.

3-Divide the sum by the number of apps for that genre (NOT BY THE TOTAL NUMBER OF APPS)

The following provides similar results to the previous app store results just without percentages.

In [34]:

genre_user_counting = {}
for genre in apple_apps_prime_genre_sorted_percent:
    total = 0 # the sum of user ratings
    len_genre = 0 # the number of apps specific to each genre
    for app in app_store_English_free:
        genre_app = app[11]
        if genre_app == genre:
            user_ratings = float(app[5])
            total += user_ratings
            len_genre += 1
    ave_num_user_ratings = total / len_genre
    genre_user_counting[genre] = ave_num_user_ratings
    
    
genre_user_counting = dict(sorted(genre_user_counting.items(), key=lambda x:x[1], reverse=True))
print(f'the frequency table for the genre of google apps is as follows:')
for item in genre_user_counting.items():
    print(item)

the frequency table for the genre of google apps is as follows:
('Navigation', 86090.33333333333)
('Reference', 74942.11111111111)
('Social Networking', 71548.34905660378)
('Music', 57326.530303030304)
('Weather', 52279.892857142855)
('Book', 39758.5)
('Food & Drink', 33333.92307692308)
('Finance', 31467.944444444445)
('Photo & Video', 28441.54375)
('Travel', 28243.8)
('Shopping', 26919.690476190477)
('Health & Fitness', 23298.015384615384)
('Sports', 23008.898550724636)
('Games', 22788.6696905016)
('News', 21248.023255813954)
('Productivity', 21028.410714285714)
('Utilities', 18684.456790123455)
('Lifestyle', 16485.764705882353)
('Entertainment', 14029.830708661417)
('Business', 7491.117647058823)
('Education', 7003.983050847458)
('Catalogs', 4004.0)
('Medical', 612.0)


In [35]:
for row in app_store_English_free:
    if row[11] == 'Navigation':
        print(row[1], 'is', row[5])

Waze - GPS Navigation, Maps & Real-time Traffic is 345046
Google Maps - Navigation & Transit is 154911
Geocaching® is 12811
CoPilot GPS – Car Navigation & Offline Maps is 3582
ImmobilienScout24: Real Estate Search in Germany is 187
Railway Route Search is 5


Therefore, on average navigation apps have the highest number of user reviews but it's dominated by Waze and Google Maps both have a great participation in this genre with half a million users in total.

In [36]:
for row in app_store_English_free:
    if row[11] == 'Travel':
        print(row[1], 'is', row[5])

Google Earth is 446185
Yelp - Nearby Restaurants, Shopping & Services is 223885
GasBuddy is 145549
TripAdvisor Hotels Flights Restaurants is 56194
Uber is 49466
Lyft is 46922
HotelTonight - Great Deals on Last Minute Hotels is 32341
Hotels & Vacation Rentals by Booking.com is 31261
Southwest Airlines is 30552
Airbnb is 22302
Expedia Hotels, Flights & Vacation Package Deals is 10278
Fly Delta is 8094
Hopper - Predict, Watch & Book Flights is 6944
United Airlines is 5748
Skiplagged — Actually Cheap Flights & Hotels is 1851
Viator Tours & Activities is 1839
iExit Interstate Exit Guide is 1798
Gogo Entertainment is 1482
Google Street View is 1450
Webcams – EarthCam is 912
HISTORY Here is 685
DB Navigator is 512
Mobike - Dockless Bike Share is 494
MiFlight™ – Airport security line wait times at checkpoints for domestic and international travelers is 493
BlaBlaCar - Trusted Carpooling is 397
Six Flags is 353
Google Trips – Travel planner is 329
Voyages-sncf.com : book train and bus tickets i

As we see TripAdvisor is the best travel app with only 56k users, that's a great opportunity to make a travel app on apple store.

In [37]:
for row in app_store_English_free:
    if row[11] == 'Catalogs':
        print(row[1], 'is', row[5])

CPlus for Craigslist app - mobile classifieds is 13345
DRAGONS MODS FREE for Minecraft PC Game Edition is 2027
Face Swap and Copy Free – Switch & Fusion Faces in a Photo is 431
Ringtone Remixes - Marimba Remix Ringtones is 213


In [38]:
for row in app_store_English_free:
    if row[11] == 'Reference':
        print(row[1], 'is', row[5])

Bible is 985920
Dictionary.com Dictionary & Thesaurus is 200047
Dictionary.com Dictionary & Thesaurus for iPad is 54175
Google Translate is 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran is 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition is 17588
Merriam-Webster Dictionary is 16849
Night Sky is 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) is 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools is 4693
GUNS MODS for Minecraft PC Edition - Mods Tools is 1497
Guides for Pokémon GO - Pokemon GO News and Cheats is 826
WWDC is 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free is 718
VPN Express is 14
Real Bike Traffic Rider Virtual Reality Glasses is 8
教えて!goo is 0
Jishokun-Japanese English Dictionary & Translator is 0


In the last two blocks two app profile recommendation is provided for the app store.

It seems most people are interested in Craiglist-mobile classifieds to sell their stuff with 13345 popularity while the next choice on that genre only has 2027 popularity, so it would be great to make an app in genre of Catalogs.

Also, they were interested in Bible, Dictionary.com, and translations. Also, after Bible and Dictionary.com there is a large drop of popularity from 200k to only 54k

# Most Popular Apps by Genre on Google Play

Below, we can see the number of installs for the genres of the latest complete dataset for the google play store apps

In [39]:
display_table(android_clean_English_free, 5)

1,000,000+  is  15.749097472924186
100,000+  is  11.563628158844766
10,000,000+  is  10.503158844765343
10,000+  is  10.209837545126353
1,000+  is  8.393501805054152
100+  is  6.915613718411552
5,000,000+  is  6.825361010830325
500,000+  is  5.561823104693141
50,000+  is  4.7721119133574
5,000+  is  4.512635379061372
10+  is  3.5424187725631766
500+  is  3.2490974729241873
50,000,000+  is  2.3014440433213
100,000,000+  is  2.1322202166064983
50+  is  1.917870036101083
5+  is  0.78971119133574
1+  is  0.5076714801444043
500,000,000+  is  0.2707581227436823
1,000,000,000+  is  0.22563176895306858
0+  is  0.04512635379061372
0  is  0.01128158844765343


The provided data for number of installs has the information like 1,000,000+ and it is not clear what the exact count is. We don't need such info, we just need to find out what app genres attract more users.

In [40]:
categories = freq_table(android_clean_English_free, 1)
avg_num_installs_dic = {}

for category in categories:
    total = 0          # sum of installs specific to each genre
    len_category = 0   # number of apps specific to each genre
    avg_num_installs = 0
    for row in android_clean_English_free:
        category_app = row[1]
        if category_app == category:
            num_installs = row[5]
            num_installs = num_installs.replace('+', '')
            num_installs = num_installs.replace(',', '')
            total += float(num_installs)
            len_category += 1
    avg_num_installs = total / len_category
    avg_num_installs_dic[category] = avg_num_installs
avg_num_installs_dic_sorted = dict(sorted(avg_num_installs_dic.items(), key=lambda x:x[1], reverse=True))

In [41]:
for item in avg_num_installs_dic_sorted.items():
    print(item)

('COMMUNICATION', 38326063.197916664)
('VIDEO_PLAYERS', 24790074.17721519)
('SOCIAL', 23253652.127118643)
('PHOTOGRAPHY', 17840110.40229885)
('PRODUCTIVITY', 16772838.591304347)
('TRAVEL_AND_LOCAL', 13984077.710144928)
('GAME', 12914435.883748516)
('TOOLS', 10801391.298666667)
('NEWS_AND_MAGAZINES', 9549178.467741935)
('ENTERTAINMENT', 9146923.076923076)
('BOOKS_AND_REFERENCE', 8767811.894736841)
('SHOPPING', 7036877.311557789)
('PERSONALIZATION', 5201482.6122448975)
('FAMILY', 5180161.789906103)
('WEATHER', 5074486.197183099)
('SPORTS', 4274688.722772277)
('HEALTH_AND_FITNESS', 4167457.3602941176)
('MAPS_AND_NAVIGATION', 4056941.7741935486)
('ART_AND_DESIGN', 1986335.0877192982)
('FOOD_AND_DRINK', 1924897.7363636363)
('EDUCATION', 1768500.0)
('BUSINESS', 1704192.3399014778)
('LIFESTYLE', 1437816.2687861272)
('FINANCE', 1387692.475609756)
('HOUSE_AND_HOME', 1331540.5616438356)
('DATING', 854028.8303030303)
('COMICS', 817657.2727272727)
('AUTO_AND_VEHICLES', 647317.8170731707)
('LIBRARI

In [42]:
for row in android_clean_English_free:
    if row[1] == 'COMMUNICATION' and (row[5] == '1,000,000,000+' or 
                                     row[5] == '500,000,000+' or
                                     row[5] == '100,000,000+'):
        print(row[0], ' is ', row[5])

Messenger – Text and Video Chat for Free  is  1,000,000,000+
Gmail  is  1,000,000,000+
imo beta free calls and text  is  100,000,000+
imo free video calls and chat  is  500,000,000+
Android Messages  is  100,000,000+
Google Duo - High Quality Video Calls  is  500,000,000+
UC Browser - Fast Download Private & Secure  is  500,000,000+
Skype - free IM & video calls  is  1,000,000,000+
Who  is  100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji  is  100,000,000+
WhatsApp Messenger  is  1,000,000,000+
Google Chrome: Fast & Secure  is  1,000,000,000+
Firefox Browser fast & private  is  100,000,000+
Messenger Lite: Free Calls & Messages  is  100,000,000+
LINE: Free Calls & Messages  is  500,000,000+
Hangouts  is  1,000,000,000+
Kik  is  100,000,000+
KakaoTalk: Free Calls & Text  is  100,000,000+
Opera Mini - fast web browser  is  100,000,000+
Opera Browser: Fast and Secure  is  100,000,000+
Telegram  is  100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer  is  100,000,000+
UC 

On average, communication apps have the highest number of installs, apps like "Messenger - Text and Video Chat for Free", "Gmail", "Skype", "WhatsApp" have a great participation in this.

To find out how largely the most popular communication apps dominate the installs we can make a smaller dataset and calculate the average

In [43]:
less_than_100m = []
for row in android_clean_English_free:
    num_installs = row[5]
    num_installs = num_installs.replace('+', '')
    num_installs = num_installs.replace(',', '')
    num_installs = float(num_installs)
    if row[1] == 'COMMUNICATION' and num_installs < 100000000:
        less_than_100m.append(num_installs)
sum(less_than_100m)/len(less_than_100m)

3593510.3486590036

As we saw If we remove the communication apps that have over 100m installs the average number of installs would reduce ten times.

In [44]:
for row in android_clean_English_free:
    if row[1] == 'TRAVEL_AND_LOCAL' and (row[5] == '1,000,000,000+' or 
                                     row[5] == '500,000,000+' or
                                     row[5] == '100,000,000+' or
                                     row[5] == '50,000,000+'):
        print(row[0], ' is ', row[5])

trivago: Hotels & Travel  is  50,000,000+
Booking.com Travel Deals  is  100,000,000+
VZ Navigator  is  50,000,000+
2GIS: directory & navigator  is  50,000,000+
Google Street View  is  1,000,000,000+
Maps - Navigate & Explore  is  1,000,000,000+
TripAdvisor Hotels Flights Restaurants Attractions  is  100,000,000+
MAPS.ME – Offline Map and Travel Navigation  is  50,000,000+
Google Earth  is  100,000,000+


In Travel and Local industry there are only few apps with more than 100m installations which are specifically in the field of Travel category.

# Summary

We were supposed to study datasets of both google play and apple store to find out what profile of apps would be profitable for both of the markets.

As we saw there is a kind of large lack of total number of ratings of apps on Apple Store and the number of users on Google store for TRAVEL apps therefore TRAVEL related apps are the most opportunistic app to build for both Apple Store and Google Store.