# P1. Analyzing Mobile Apps Data To Increase Revenue

In this project we will be analyzing data for a company that builds mobile apps for Android and iOS devices; these apps are available on Google Play and the App Store.

These mobile apps are free to download and install, their main source of income is through in-app ads. The means of revenue for any app is correllated to the numbers of users who use the app, we aim to provide an analysis to help developers understand the type of apps that attract more users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
def open_file(name_file):
    apps_data = []
    from csv import reader
    opened_file = open(name_file)
    read_file = reader(opened_file)
    apps_data = list(read_file)
    return apps_data

In [3]:
apple_store = open_file('AppleStore.csv')
google_store = open_file('googleplaystore.csv')

#explore_data(apple_store, 0, 5, True)
#explore_data(google_store, 0,5, False)

col_names_apple = apple_store[0]
col_names_google = google_store[0]
print(col_names_apple)
print(col_names_google)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


**Potential Significant Variables in our analysis**

| Column_Name      |  Description   |
|------------------|----------------|
|rating_count_total| total # ratings|
| user_rating      | user rating    |
| cont_rating      | content rating |
| prime_genre      | genre          |
| lang.num         | language       |


**Data Cleaning **

In [4]:
'''This row contains an error (missing value), we will use the 
del statement to delete it, as this is an incorrect data entry'''
print (google_store[10473])
# only run the del statement once
del google_store[10473]


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


**Removing Duplicate Entries **

It appears that the Google dataset contains some duplicate rows. 

In [5]:
def duplicate_apps(dataset):
    unique_app = []
    duplicate_app_list = []
    for app in dataset[1:]:
        if dataset[0][-1] == 'Android Ver':
            app_name = app[0]
            if app_name in unique_app:
                duplicate_app_list.append(app_name)
            else:
                unique_app.append(app_name)
        ##this is for Apple store only
        else:
            app_name = app[1]
            if app_name in unique_app:
                duplicate_app_list.append(app_name)
            else:
                unique_app.append(app_name)        
    return(duplicate_app_list)

In [6]:
print(len(duplicate_apps(google_store)))

print('There are 1181 duplicate apps in Google Store')

print(len(duplicate_apps(apple_store)))

print('There are 2 duplicate apps in Apple Store')

test = (duplicate_apps(google_store))
print(test[:5])

1181
There are 1181 duplicate apps in Google Store
2
There are 2 duplicate apps in Apple Store
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


**Criterion to remove duplicates: **

Instead of removing duplicates randomly, we will keep the entry with the highest number of reviews and remove the other duplicate entries

In [7]:
for app in google_store[1:]:
    name = app[0]
    if name =='Google My Business':
        print(app)
    

['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']
['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


In [8]:
#determine highest review count for each unique app
def max_google_reviews(dataset):
    reviews_max = {}
    for row in dataset[1:]:
        name = row[0]
        n_reviews = float(row[3])
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    return reviews_max

reviews_max = max_google_reviews(google_store)

#remove duplicates and keep app with highest review count
def remove_google_dups(dataset, reviews_max):
    android_clean = []
    already_added = []
    for row in dataset[1:]:
        name = row[0]
        n_reviews = float(row[3])
        if n_reviews == reviews_max[name] and name not in already_added:
            android_clean.append(row)
            already_added.append(name)
    return android_clean, already_added

remove_dups = remove_google_dups(google_store, reviews_max)


In [9]:
#determine highest review count for each unique app
def max_apple_reviews(dataset):
    reviews_max = {}
    for row in dataset[1:]:
        name = row[1]
        n_reviews = float(row[5])
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    return reviews_max

reviews_max_apple = max_apple_reviews(apple_store)

#remove duplicates and keep app with highest review count
def remove_apple_dups(dataset, reviews_max):
    iOS_clean = []
    already_added = []
    for row in dataset[1:]:
        name = row[1]
        n_reviews = float(row[5])
        if n_reviews == reviews_max[name] and name not in already_added:
            iOS_clean.append(row)
            already_added.append(name)
    return iOS_clean, already_added

remove_dups_apple = remove_apple_dups(apple_store, reviews_max_apple)

In [10]:
android_clean = remove_dups[0]
android_already_added = remove_dups[1]
print(len(android_already_added))

iOS_clean = remove_dups_apple[0]
iOS_already_added = remove_dups_apple[1]
print(str(len(iOS_clean)) + ' # of unique iOS apps')

9659
7195 # of unique iOS apps


**Removing non-English apps from both datasets**

In this section we will remove non-English apps. Apps that don't have english characters and up to three non-English characters (e.g. emojis).

In [11]:
def is_english(string):
    english = True
    non_english_char = 0
    for char in string:
        if ord(char)> 127:
            non_english_char += 1
            if non_english_char >3:
                english = False
    return english

#checking some examples with our function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [12]:
english_iOS_clean = []
english_android_clean = []

for row in iOS_clean:
    name = row[1]
    if is_english(name):
        english_iOS_clean.append(row)


for row in android_clean:
    name = row[0]
    if is_english(name):
        english_android_clean.append(row)

print(len(english_iOS_clean))
print(len(english_android_clean))


6181
9614


**Isolating free Apps**

We will loop through the datasets to determine the apps that are free.

In [13]:
google_free = []
apple_free = []

for row in english_android_clean:
    price = row[7]
    if price == '0':
        google_free.append(row)
        
for row in english_iOS_clean:
    price = float(row[4])
    if price == 0:
        apple_free.append(row)

print(len(google_free))
print(len(apple_free))

print(google_free[1])
print(apple_free[1])


8864
3220
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


**Understanding Popular Genres for Android and iOS markets**

Because the end goal is to add profitable apps in both App Store and Google play, we need to understand apps that are successful in both markets. 

In [14]:
#building frequency tables
google_ft_cat = {}
google_ft_genre = {}
apple_ft = {}
for app in google_free:
    genre = app[9]
    if genre in google_ft_genre:
        google_ft_genre[genre] += 1
    else:
        google_ft_genre[genre] = 1

for app in google_free:
    cat = app[1]
    if cat in google_ft_cat:
        google_ft_cat[cat] += 1
    else:
        google_ft_cat[cat] = 1


for app in apple_free:
    genre = app[11]
    if genre in apple_ft:
        apple_ft[genre] += 1
    else:
        apple_ft[genre] = 1

print(google_ft_cat)
print(google_ft_genre)
print(apple_ft)

{'AUTO_AND_VEHICLES': 82, 'HEALTH_AND_FITNESS': 273, 'DATING': 165, 'COMICS': 55, 'TRAVEL_AND_LOCAL': 207, 'SHOPPING': 199, 'COMMUNICATION': 287, 'LIFESTYLE': 346, 'GAME': 862, 'TOOLS': 750, 'MEDICAL': 313, 'EVENTS': 63, 'ART_AND_DESIGN': 57, 'PRODUCTIVITY': 345, 'NEWS_AND_MAGAZINES': 248, 'BUSINESS': 407, 'FINANCE': 328, 'LIBRARIES_AND_DEMO': 83, 'EDUCATION': 103, 'SOCIAL': 236, 'FOOD_AND_DRINK': 110, 'PARENTING': 58, 'SPORTS': 301, 'PERSONALIZATION': 294, 'HOUSE_AND_HOME': 73, 'BEAUTY': 53, 'BOOKS_AND_REFERENCE': 190, 'ENTERTAINMENT': 85, 'PHOTOGRAPHY': 261, 'WEATHER': 71, 'FAMILY': 1676, 'MAPS_AND_NAVIGATION': 124, 'VIDEO_PLAYERS': 159}
{'Educational;Pretend Play': 8, 'Arcade;Action & Adventure': 11, 'Medical': 313, 'News & Magazines': 248, 'Health & Fitness;Education': 1, 'Video Players & Editors': 157, 'Comics': 54, 'Food & Drink': 110, 'Educational;Creativity': 3, 'Education;Brain Games': 3, 'Entertainment;Music & Video': 15, 'Parenting;Music & Video': 6, 'Board': 34, 'Entertainm

In [25]:
## creating a frequency table function

def freq_table(dataset, index):
    ft_dictionary = {}
    total_ct = len(dataset)
    for row in dataset:
        ft_col = row[index]
        if ft_col in ft_dictionary:
            ft_dictionary[ft_col] += 1
        else:
            ft_dictionary[ft_col] = 1
    for genre in ft_dictionary:
        ft_dictionary[genre] = 100*(ft_dictionary[genre]/total_ct)
        
    return ft_dictionary

## function to sort the values within the dictionary

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(google_free, 9)
display_table(google_free, 1)
display_table(apple_free, 11)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

**Analyzing Frequency Tables**

In [31]:
display_table(apple_free, 11)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


The most common genre in the App Store is 'Games', followed by 'Entertainment'. 'Games' apps take over 50% of all the apps developed, it is a competitive market. Although competitive, it seems to attract the most users. 

In [33]:
display_table(google_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The most popular genre in the Google Play market is 'Tools' followed by 'Entertainment'. The 'Tools' genre is not well represented within the App Store, given its popularity in the Google Play market it could be a popular app within the App Store as well. 

**Most Popular Apps by Genre on the App Store**

In [47]:
iOS_genre = freq_table(apple_free, 11)
iOS_genre

for genre in iOS_genre:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre +=1
    avg_num_rating = total/len_genre
    print(genre, ':',avg_num_rating)

Medical : 612.0
Photo & Video : 28441.54375
Finance : 31467.944444444445
Catalogs : 4004.0
Business : 7491.117647058823
Education : 7003.983050847458
Sports : 23008.898550724636
Weather : 52279.892857142855
Social Networking : 71548.34905660378
Book : 39758.5
Utilities : 18684.456790123455
Travel : 28243.8
Productivity : 21028.410714285714
Music : 57326.530303030304
Games : 22812.92467948718
Lifestyle : 16485.764705882353
Shopping : 26919.690476190477
Reference : 74942.11111111111
Navigation : 86090.33333333333
News : 21248.023255813954
Health & Fitness : 23298.015384615384
Food & Drink : 33333.92307692308
Entertainment : 14029.830708661417


Navigation apps appear to have the highest number of ratings in the App Store. 

In [49]:
display_table(google_free, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


**Most Popular Apps by Genre on Google Play**

In [1]:
google_cat = freq_table(google_free, 1)

for cat in google_cat:
    total = 0
    len_category = 0
    for app in google_free:
        category_app = app[1]
        if category_app == cat:
            n_install = app[5]
            n_install = n_install.replace('+', '')
            n_install = n_install.replace(',', '')
            n_install = float(n_install)
            total += n_install
            len_category += 1
    avg_num_install = total/len_category
    print(cat, ':', avg_num_install)

NameError: name 'freq_table' is not defined

The 'Communications' category appear to have the highest number of installs. 