# Free Mobile Apps Analysis
Guided Project: Profitable App Profiles for the App Store and Google Play Markets      

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.    

The main purpose for me (developer) is to improve my Python and Data Analysis skills and also use this project to start my portifolio.


In [3]:
# reading the Apple Store apps review file and storing on a data set (ds_apple)
from csv import reader
file_apple = open('AppleStore.csv')
rd_apple = reader(file_apple)
ds_apple = list(rd_apple)

In [4]:
# reading the Google Play apps review and storing on a data set (ds_apple)
file_google = open('googleplaystore.csv')
rd_google = reader(file_google)
ds_google = list(rd_google)

In [5]:
# Function to display the rows in a better formating
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [6]:
# displaying the first 3 rows from Google
explore_data(ds_google, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [7]:
# displaying the first 3 rows from Apple
explore_data(ds_apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


Google selected columns to be used in our analysis:
    * App, Category, Rating, Reviews, Size, Installs, 
    * Type, Price, Content Rating, Genres, Last Update

Apple selected columns to be used in our analysis:
    * track_name, size_bytes, price, rating_count_tot 
    * rating_count_ver, user_rating, cont_rating, prime_genre


In [8]:
# Checking the problem about the missing column, second column (category).
# Problem detected in the kaggle forum thread: 
# https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015
print(ds_google[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [9]:
# function used to detect row with missing columns, comparing the lenght of the row with the header.
for row in ds_google:
    header_length = len(ds_google[0])
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(ds_google.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


In [10]:
# removing the row with problem (another approach would be find the missing value)
del ds_google[10473]

In [11]:
# row deleted
print(ds_google[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [12]:
# detect row missing columns, Apple
for row in ds_apple:
    header_length = len(ds_apple[0])
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(ds_apple.index(row))

Row with missing columns was deleted on Google ds and we didn't find any row with problems on Apple ds.

In [13]:
# checking duplicate rows (google apps with same names with duplicate reviews)
def duplicate_apps(data, vendor='google'):
    dup_apps = []
    uni_apps = []
    
    if vendor == 'google':
        app_col = 0
    elif vendor == 'apple':
        app_col = 1
    else:  
        print('Error, unknown vendor.')
        return

    for app in data:
        name = app[app_col]
        if name in uni_apps:
            dup_apps.append(name)
        else:
            uni_apps.append(name)
    
    print('Duplicate apps count: ', len(dup_apps))
    print('\n')
    print('# Example of duplicate apps:')
    for x in dup_apps[:10]:
        print(x)



In [14]:
# Checking duplicate apps on ds_google
duplicate_apps(ds_google)

Duplicate apps count:  1181


# Example of duplicate apps:
Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings
Box
Zenefits
Google Ads
Google My Business
Slack


In [15]:
def get_app_by_name(name, data, vendor='google'):
    # return rows from apps list matching a app name
    if vendor == 'google':
        app_col = 0
    elif vendor == 'apple':
        app_col = 1
    else:  
        print('Error, unknown vendor.')
        return
    
    for row in data:
        app = row[app_col]
        if name == app:
            print(row)
            

In [16]:
get_app_by_name('Slack', ds_google, 'google')

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


We can see in duplicate app above, that the 4th column is different, is the number of ratings count.
We'll consider the max rating_count value as the most updated row, so we can remove the other rows.

In [17]:
# storing the max reviews number for each app in a dictionary
reviews_max = {}
for row in ds_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if (name in reviews_max and reviews_max[name] < n_reviews) or not(name in reviews_max):  
        reviews_max[name] = n_reviews

print(len(reviews_max))

    

9659


For each app from Google, it will check if is the app row has the max num of reviews.
If is the max, this is the most updated row for the app, it will be added to a clean apps list, without duplicates

In [18]:
# for each app check if is the row with the max num of reviews and add to a clean list, without duplicate
google_clean = []
already_added = []
for row in ds_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and not(name in already_added):
        google_clean.append(row)
        already_added.append(name)
    

In [19]:
# Checking the clean list from google apps:
explore_data(google_clean, 0, 3, True)        


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


In [20]:
# Checking duplicate apps on ds_apple
duplicate_apps(ds_apple, 'apple')

Duplicate apps count:  2


# Example of duplicate apps:
Mannequin Challenge
VR Roller Coaster


In [21]:
get_app_by_name('Mannequin Challenge', ds_apple, 'apple')

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


In [22]:
# Apple - storing the max reviews number for each app in a dictionary
reviews_max = {}
for row in ds_apple[1:]:
    name = row[1]
    n_reviews = float(row[5])
    if (name in reviews_max and reviews_max[name] < n_reviews) or not(name in reviews_max):  
        reviews_max[name] = n_reviews

print(len(reviews_max))
print(len(ds_apple))

7195
7198


In [23]:
# Apple - for each app check if is the row with the max num of reviews and add to a clean list, without duplicate
apple_clean = []
already_added = []
for row in ds_apple[1:]:
    name = row[1]
    n_reviews = float(row[5])
    if n_reviews == reviews_max[name] and not(name in already_added):
        apple_clean.append(row)
        already_added.append(name)

In [24]:
# apple_clean list contains only unique apps now
duplicate_apps(apple_clean, 'apple')
print(len(ds_apple))
print(len(apple_clean))

Duplicate apps count:  0


# Example of duplicate apps:
7198
7195


Until now, we removed the duplicated apps entries for Apple and Google.

Now we are going to check non-english chars in the apps name

In [25]:
# if the string has a character higher than ord (127), is not english char, then return false, otherwise is true
def is_english(p_string):
    no_eng_count = 0
    for c in p_string:
        if ord(c) > 127:
            no_eng_count += 1
        if no_eng_count > 3:
            return False
    return True    

In [26]:
# Testing function no english chars
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('Instachat 😜😜😜😜😜'))

True
False
True
True
False


In [27]:
# create new data set for google apps, english apps name only
google_english = []
for row in google_clean:
    if is_english(row[0]):
        google_english.append(row)
            

In [28]:
print('google clean rows: ', len(google_clean))
print('google english rows: ', len(google_english))

google clean rows:  9659
google english rows:  9614


We have 9614 from 9659 Google apps remaining with English name

In [29]:
# Checking Apple apps (english chars)
# create new data set for apple apps, english apps name only
apple_english = []
for row in apple_clean:
    if is_english(row[1]):
        apple_english.append(row)

In [30]:
print('Lenght apple_clean: ', len(apple_clean))
print('lenght apple english: ', len(apple_english))

Lenght apple_clean:  7195
lenght apple english:  6181


We have 6181 from 7195 Apples apps remaining with English name

Getting only free apps

In [31]:
# new list with only free google apps
google_free = []
for row in google_english:
    if row[6] == 'Free' or row[6] == 'free':
        google_free.append(row)

In [32]:
print(len(google_english))
print(len(google_free))

9614
8863


From 9614 Google apps we have 8863 free apps remaining

In [33]:
# list for only free Apple apps
apple_free = []
for row in apple_english:
    if row[4] == '0.0':
        apple_free.append(row)

In [34]:
print(len(apple_english))
print(len(apple_free))

6181
3220


For Apple, from 6181 apps we have 3220 free apps remaining

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.



In [35]:
# checking amount of apps per category in Googl
freq_google_category = {}

for row in google_free:
    cat = row[1]
    if cat in freq_google_category:
        freq_google_category[cat] += 1
    else:
        freq_google_category[cat] = 1

for i in freq_google_category:
    print(i, freq_google_category[i])    
    
    

WEATHER 71
COMMUNICATION 287
SPORTS 301
FINANCE 328
AUTO_AND_VEHICLES 82
PRODUCTIVITY 345
VIDEO_PLAYERS 159
SOCIAL 236
ART_AND_DESIGN 57
BUSINESS 407
PARENTING 58
LIBRARIES_AND_DEMO 83
PERSONALIZATION 294
BOOKS_AND_REFERENCE 190
SHOPPING 199
NEWS_AND_MAGAZINES 248
TOOLS 750
FOOD_AND_DRINK 110
PHOTOGRAPHY 261
MAPS_AND_NAVIGATION 124
FAMILY 1675
TRAVEL_AND_LOCAL 207
HOUSE_AND_HOME 73
EDUCATION 103
DATING 165
ENTERTAINMENT 85
BEAUTY 53
HEALTH_AND_FITNESS 273
GAME 862
COMICS 55
LIFESTYLE 346
MEDICAL 313
EVENTS 63


In [36]:
# return a frequency table, given a dataset and column index
def freq_table(dataset, index):
    freq_table = {}

    for row in dataset:
        col = row[index]
        if col in freq_table:
            freq_table[col] += 1
        else:
            freq_table[col] = 1
            
    return freq_table


In [37]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [38]:
# google category frequency table
display_table(google_free, 1)

FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


In [39]:
# Apple prime_genre frequency table
display_table(apple_free, 11)

Games : 1872
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


In [40]:
# Google genre frequency table
display_table(google_free, 9)

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

Analyze the frequency table you generated for the prime_genre column of the App Store data set.  

** What is the most common genre? What is the runner-up?  **  
 Games : 1872      
      
** What other patterns do you see?  **  
 Top 5: Entertainment, Photo & Video, Education and Social Networking  
 Public should be younger (kids, teenagers) using the apps for entertainment.
        
** What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?  **  
 Most apps are for Entertainement.
      
** Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users? **  
 Game apps has an high number of users, I would recommend Fruit Ninja, Clear Vision and Minecraft, they have above 5 millions of rate reviews.   

Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.

** What are the most common genres? **  
Genre (Tools: 749) and Category (Family: 1675)

**What other patterns do you see?**  
Public seems to be older (adults), using phone for daily activities.
Tools : 749, Entertainment : 538, Education : 474, Business : 407, Productivity : 345

**Compare the patterns you see for the Google Play market with those you saw for the App Store market.** 
Seems that Apple store users are younger and use apps for entertainment while Google Play users are adults using phone to day by day tasks and family.

**Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?**  
Social media apps have most users (Facebook, whatsapp, Instagram). But I would recommend the game apps: Clash of Clans, Subway Surfer and Candy Crush game apps.

** Popular Apps by Genre** 

In [42]:
f_apple_pgenre = freq_table(apple_free, 11)


In [48]:
# Cheking for Apple genres
for genre in f_apple_pgenre:
    total = 0
    len_genre = 0
    for row in apple_free:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5]) # user reviews
            len_genre += 1
    avg_user_rating = round(total / len_genre)
    print(genre, ':', avg_user_rating)
    

Health & Fitness : 23298
Medical : 612
News : 21248
Social Networking : 71548
Shopping : 26920
Catalogs : 4004
Entertainment : 14030
Sports : 23009
Education : 7004
Reference : 74942
Business : 7491
Photo & Video : 28442
Music : 57327
Utilities : 18684
Book : 39758
Food & Drink : 33334
Weather : 52280
Games : 22813
Travel : 28244
Navigation : 86090
Finance : 31468
Lifestyle : 16486
Productivity : 21028


Checking the average number of ratings for Apple Store, seems like Navigation is the prime genre with most user ratings: Navigation : 86.090
Most popular by orderd: Waze, Google Maps and Geocaching. 

In [51]:
# Checking average user rating for category in Google Play:
f_google_category = freq_table(google_free, 1)

for category in f_google_pgenre:
    total = 0
    len_category = 0
    for row in google_free:
        category_app = row[1]
        if category_app == category:
            total += float(row[3]) # reviews
            len_category += 1
    avg_user_rating = round(total / len_category)
    print(category, ':', avg_user_rating)

WEATHER : 171251
COMMUNICATION : 995608
SPORTS : 116939
FINANCE : 38536
AUTO_AND_VEHICLES : 14140
PRODUCTIVITY : 160635
VIDEO_PLAYERS : 425350
SOCIAL : 965831
ART_AND_DESIGN : 24699
BUSINESS : 24240
PARENTING : 16379
LIBRARIES_AND_DEMO : 10926
PERSONALIZATION : 181122
BOOKS_AND_REFERENCE : 87995
SHOPPING : 223887
NEWS_AND_MAGAZINES : 93088
TOOLS : 305733
FOOD_AND_DRINK : 57479
PHOTOGRAPHY : 404081
MAPS_AND_NAVIGATION : 142860
FAMILY : 113211
TRAVEL_AND_LOCAL : 129484
HOUSE_AND_HOME : 26435
EDUCATION : 56293
DATING : 21953
ENTERTAINMENT : 301752
BEAUTY : 7476
HEALTH_AND_FITNESS : 78095
GAME : 683524
COMICS : 42586
LIFESTYLE : 33922
MEDICAL : 3730
EVENTS : 2556


Checking the average number of ratings for Google Play per category, seems like Coomunication is the category with most user reviews: COMMUNICATION : 995.608
Whatsapp is the most popular communication app, followed by Messenger and then UC Browser and BBM Free Calls. I would recommend UC and BBC, since whataspp and messenger are already known.


In [52]:
# Checking average INSTALLS for category in Google Play:
f_google_category = freq_table(google_free, 1)

for category in f_google_pgenre:
    total = 0
    len_category = 0
    for row in google_free:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = installs.replace(',', '')
            total += float(installs)
            len_category += 1
    avg_installs = round(total / len_category)
    print(category, ':', avg_installs)

WEATHER : 5074486
COMMUNICATION : 38456119
SPORTS : 3638640
FINANCE : 1387692
AUTO_AND_VEHICLES : 647318
PRODUCTIVITY : 16787331
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
ART_AND_DESIGN : 1986335
BUSINESS : 1712290
PARENTING : 542604
LIBRARIES_AND_DEMO : 638504
PERSONALIZATION : 5201483
BOOKS_AND_REFERENCE : 8767812
SHOPPING : 7036877
NEWS_AND_MAGAZINES : 9549178
TOOLS : 10801391
FOOD_AND_DRINK : 1924898
PHOTOGRAPHY : 17840110
MAPS_AND_NAVIGATION : 4056942
FAMILY : 3697848
TRAVEL_AND_LOCAL : 13984078
HOUSE_AND_HOME : 1331541
EDUCATION : 1833495
DATING : 854029
ENTERTAINMENT : 11640706
BEAUTY : 513152
HEALTH_AND_FITNESS : 4188822
GAME : 15588016
COMICS : 817657
LIFESTYLE : 1437816
MEDICAL : 120551
EVENTS : 253542


Communication is the most popular genre, with more installs on google (38.456.119 installs).

Considering both Apple and Google, the Communication / Social apps seems to be the most popular in both platforms, according to user ratings and also considering numbers of general users.
The most popular apps for this category are WhatsApp and Messenger.