# Guided Project: Profitable App Profiles for the App Store and Google Play Markets

For this project, we'll pretend we're working as a data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
import csv
import pprint

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# Import to data sets
apple_store = list(csv.reader(open("AppleStore.csv", encoding="utf8")))
google_store = list(csv.reader(open("googleplaystore.csv", encoding="utf8")))

# Show First Few Row of data sets
print("Apple Store Data")
explore_data(apple_store,0,3)
print("Google Play Store Data")
explore_data(google_store,0,3)

Apple Store Data
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Google Play Store Data
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 

In [4]:
# Error in google store data row 10473
# Rating Above 5 so error in rating
print("Error in rating value of {} for '{}' app.".format(google_store[10473][2],google_store[10473][0]))

Error in rating value of 19 for 'Life Made WI-Fi Touchscreen Photo Frame' app.


In [5]:
# delete bad row of data
del google_store[10473]

## Removing duplicate app data rows

In [6]:
def unique(list_of_list, index, header=True, print_output=True):
    """
    Makes a list of duplicate and unique values in a list of lists
    based on the input index
    """
    
    duplicate = []
    unique = []
    
    if header:
        for row in list_of_list[1:]:
            name = row[index]
            if name in unique:
                duplicate.append(name)
            else:
                unique.append(name)
    
    if print_output:
        print("Duplicate: " ,duplicate)
        print("   Unique: " ,unique)
    
    return (duplicate, unique)

In [7]:
duplicate_google, unique_google = unique(google_store, 0, print_output=False)
print("Number of duplicate apps in the google store data: ",len(duplicate_google))
print(duplicate_google[:5])

Number of duplicate apps in the google store data:  1181
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


See the instagram duplicates row in google play store data.

In [8]:
for app in google_store:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


So we will filter based on the row that has the highest number of ratings in column index 3.

In [9]:
def find_high_value(list_of_lists, index_1, index_2):
    """
    Creates dict of index_1 unique values and highest of
    index_2 value
    """
    
    duplicate_dict = {}
    
    # find highest value in index_2 and create dict of that value
    for row in list_of_lists:
        name, value = row[index_1], int(row[index_2])
        if (name in duplicate_dict) and (duplicate_dict[name] > value):
            pass
        else:
            duplicate_dict[name] = value
            
    return duplicate_dict

In [10]:
def remove_duplicates(list_of_lists, index_1, index_2):
    """
    Removes the duplicate values from a list of lists based on index_1 and 
    keeps the higher value of the index_2
    """
    
    filtered_list_of_lists = []
    already_added = []
    
    duplicate_dict = find_high_value(list_of_lists, index_1, index_2)
    
    # compose new list_of_lists
    for row in list_of_lists:
        name, value = row[index_1], int(row[index_2])
        if name in duplicate_dict:
            if (value == duplicate_dict[name]) and (name not in already_added):
                filtered_list_of_lists.append(row)
                already_added.append(name)
            
    return filtered_list_of_lists

In [11]:
google_store_dict = find_high_value(google_store[1:],0,3)
google_store_no_duplicates = remove_duplicates(google_store[1:],0,3)

In [12]:
print("Starting length of google store app data: {}".format(len(google_store)))
print("      Number of unique names in data set: {}".format(len(google_store_dict)))
print("                 Length of filtered data: {}".format((len(google_store_no_duplicates))))

Starting length of google store app data: 10841
      Number of unique names in data set: 9659
                 Length of filtered data: 9659


## Removing Non English Apps

Going to go through and use the corresponding ASCII character number to tell if the name of the app has a character that is not part of the english language. ASCII numbers greater than 127 are generally not part of english titles therefor we will remove any rows that has ASCII character values above 127.

In [13]:
def english(string):
    """
    Return True if more than three character in a string has an ascii
    value of above 127
    """
    count = 0
    
    for char in string:
        ascii_value = ord(char)
        if ascii_value > 127:
            count +=1
    
    if count < 4:
        return True
            
    return False

In [14]:
tests = ['Instagram','爱奇艺PPS -《欢乐颂2》电视剧热播','Docs To Go™ Free Office Suite','Instachat 😜']
for row in tests:
    print(english(row))

True
False
True
True


Remove any non english app form data set.

In [15]:
google_store_english = []
google_store_non_english = []
apple_store_english = []
apple_store_non_english = []

for row in google_store_no_duplicates:
    name = row[0]
    if english(name):
        google_store_english.append(row)
    else:
        google_store_non_english.append(row)
        
for row in apple_store:
    name = row[1]
    if english(name):
        apple_store_english.append(row)
    else:
        apple_store_non_english.append(row)

In [16]:
print("Removed {} rows from google store data".format(len(google_store_non_english)))
print("Removed {} rows from apple store data".format(len(apple_store_non_english)))

Removed 45 rows from google store data
Removed 1014 rows from apple store data


## Keep only free apps of those left

In [17]:
apple_pay_types = []
google_pay_types = []

for row in apple_store_english:
    price = row[4]
    if price not in apple_pay_types:
        apple_pay_types.append(price)


for row in google_store_english:
    price = row[6]
    if price not in google_pay_types:
        google_pay_types.append(price)

print(" Types of pay types in apple store: {}\n".format(apple_pay_types))
print("Types of pay types in google store: {}".format(google_pay_types))

 Types of pay types in apple store: ['price', '0.0', '1.99', '0.99', '6.99', '2.99', '7.99', '4.99', '9.99', '3.99', '8.99', '5.99', '14.99', '13.99', '19.99', '17.99', '15.99', '24.99', '20.99', '29.99', '12.99', '39.99', '74.99', '16.99', '249.99', '11.99', '27.99', '49.99', '59.99', '22.99', '18.99', '99.99', '21.99', '34.99', '299.99']

Types of pay types in google store: ['Free', 'Paid', 'NaN']


So based on keeping only the free apps from each store, I will only keep values marked as '0.0' in the Apple store data and values marked as 'Free' in the google store.

In [18]:
apple_free = []
google_free = []

for row in apple_store_english:
    price = row[4]
    if price == "0.0":
        apple_free.append(row)

for row in google_store_english:
    price = row[6]
    if price == "Free":
        google_free.append(row)

In [19]:
print(" Number of apple apps: {}".format(len(apple_free)))
print("Number of google apps: {}".format(len(google_free)))

 Number of apple apps: 3222
Number of google apps: 8863


## Most Common Genres In Free Apps From Google and Apple Stores in English

From the data that has been cleaned we will look at the most common genres by type in each app store.

In [20]:
def freq_table(list_of_list, index):
    
    dict_values = {}
    
    for row in list_of_list:
        genre = row[index]
        if genre in dict_values:
            dict_values[genre] += 1
        else:
            dict_values[genre] = 1
            
    return dict_values

google_free_genre_frq = freq_table(google_free, 1)
apple_free_genre_frg = freq_table(apple_free, 11)

In [21]:
def tuple_count(list_of_tuples,index):
    count = 0

    for t in list_of_tuples:
        count += t[index]
    
    return count

In [22]:
def count_tuple(list_of_tuple,index):
    count = 0

    for item in list_of_tuple:
        count += item[1]
    
    return count

In [23]:

ordered_google = sorted(google_free_genre_frq.items(), key = lambda kv:(kv[1], kv[0]), reverse=True)
ordered_apple = sorted(apple_free_genre_frg.items(), key = lambda kv:(kv[1], kv[0]), reverse=True)

# Count and percentage
google_count = tuple_count(ordered_google,1)
ordered_google = [(x[0],x[1],round(x[1]/google_count,4)) for x in ordered_google]
apple_count = tuple_count(ordered_apple,1)
orderd_apple = [(x[0],x[1],round(x[1]/apple_count,4)) for x in ordered_apple]

print("Google Genres By Count")
pprint.pprint(ordered_google)
print("\nApple Genres By Count")
pprint.pprint(orderd_apple)

Google Genres By Count
[('FAMILY', 1675, 0.189),
 ('GAME', 862, 0.0973),
 ('TOOLS', 750, 0.0846),
 ('BUSINESS', 407, 0.0459),
 ('LIFESTYLE', 346, 0.039),
 ('PRODUCTIVITY', 345, 0.0389),
 ('FINANCE', 328, 0.037),
 ('MEDICAL', 313, 0.0353),
 ('SPORTS', 301, 0.034),
 ('PERSONALIZATION', 294, 0.0332),
 ('COMMUNICATION', 287, 0.0324),
 ('HEALTH_AND_FITNESS', 273, 0.0308),
 ('PHOTOGRAPHY', 261, 0.0294),
 ('NEWS_AND_MAGAZINES', 248, 0.028),
 ('SOCIAL', 236, 0.0266),
 ('TRAVEL_AND_LOCAL', 207, 0.0234),
 ('SHOPPING', 199, 0.0225),
 ('BOOKS_AND_REFERENCE', 190, 0.0214),
 ('DATING', 165, 0.0186),
 ('VIDEO_PLAYERS', 159, 0.0179),
 ('MAPS_AND_NAVIGATION', 124, 0.014),
 ('FOOD_AND_DRINK', 110, 0.0124),
 ('EDUCATION', 103, 0.0116),
 ('ENTERTAINMENT', 85, 0.0096),
 ('LIBRARIES_AND_DEMO', 83, 0.0094),
 ('AUTO_AND_VEHICLES', 82, 0.0093),
 ('HOUSE_AND_HOME', 73, 0.0082),
 ('WEATHER', 71, 0.008),
 ('EVENTS', 63, 0.0071),
 ('PARENTING', 58, 0.0065),
 ('ART_AND_DESIGN', 57, 0.0064),
 ('COMICS', 55, 0.0062),

In the Google apps sorted by genre count data the top three genres are Family (18.9%), Games (9.7%), and Tools (8.4%). Family (18.9%) type games is almost two times larger than the the second place spot (9.7%).

The Apple apps sorted by genre count data is very different. The largest category is Games with a very large 58.2%. Entertainment (7.9%) and Photo & Video (4.9%) type apps come in a distant second and third.

From the observations above, an app developed for Family and that has a game aspect would be right in the current market of apps on the Google apps store. While an app that is more centered around game play and entertainment would be more at home on the Apple store.

## Most Users in Genres In Free Apps From Google and Apple Stores in English

Now we will go onto look at the total number of people that are using each type of app. One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the <code>Installs</code>. column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the <code>rating_count_tot</code> app.

In [24]:
def count_users_genre(list_of_list,genre_index,count_index):
    dict_values = {}

    # Count Numbers Per App
    for row in list_of_list:
        genre = row[genre_index]
        count = int(row[count_index].replace('+','').replace(',',''))
        if genre in dict_values:
            dict_values[genre] += count
        else:
            dict_values[genre] = count

    # Sort and Order
    dict_values = sorted(dict_values.items(), key = lambda kv:(kv[1], kv[0]), reverse=True)
    return dict_values

# Google Count and Freq
google_free_genre_users_frq = count_users_genre(google_free, 1, 5)
google_count = tuple_count(google_free_genre_users_frq,1)
google_free_genre_users_frq = [(x[0],x[1],round(x[1]/google_count,4)) for x in google_free_genre_users_frq]

# Apple Count and Freq
apple_free_genre_users_frq = count_users_genre(apple_free, 11, 5)
apple_count = tuple_count(apple_free_genre_users_frq,1)
apple_free_genre_users_frq = [(x[0],x[1],round(x[1]/apple_count,4)) for x in apple_free_genre_users_frq]

print("Google Genre by Downloads")
pprint.pprint(google_free_genre_users_frq)

print("\nApple Genre by Rating Count")
pprint.pprint(apple_free_genre_users_frq)

Google Genre by Downloads
[('GAME', 13436869450, 0.1786),
 ('COMMUNICATION', 11036906201, 0.1467),
 ('TOOLS', 8101043474, 0.1077),
 ('FAMILY', 6193895690, 0.0823),
 ('PRODUCTIVITY', 5791629314, 0.077),
 ('SOCIAL', 5487861902, 0.0729),
 ('PHOTOGRAPHY', 4656268815, 0.0619),
 ('VIDEO_PLAYERS', 3931731720, 0.0522),
 ('TRAVEL_AND_LOCAL', 2894704086, 0.0385),
 ('NEWS_AND_MAGAZINES', 2368196260, 0.0315),
 ('BOOKS_AND_REFERENCE', 1665884260, 0.0221),
 ('PERSONALIZATION', 1529235888, 0.0203),
 ('SHOPPING', 1400338585, 0.0186),
 ('HEALTH_AND_FITNESS', 1143548402, 0.0152),
 ('SPORTS', 1095230683, 0.0146),
 ('ENTERTAINMENT', 989460000, 0.0131),
 ('BUSINESS', 696902090, 0.0093),
 ('MAPS_AND_NAVIGATION', 503060780, 0.0067),
 ('LIFESTYLE', 497484429, 0.0066),
 ('FINANCE', 455163132, 0.006),
 ('WEATHER', 360288520, 0.0048),
 ('FOOD_AND_DRINK', 211738751, 0.0028),
 ('EDUCATION', 188850000, 0.0025),
 ('DATING', 140914757, 0.0019),
 ('ART_AND_DESIGN', 113221100, 0.0015),
 ('HOUSE_AND_HOME', 97202461, 0.0

From the google apps store, a total of 75 billion apps have been downloaded. From those downloads, the top three genres downloaded were Game (17.9%), Communication (14.7%) and Tools (10.8%). The suprise genre here is Communication, with downloads being larger than the number of apps in that catigory. Showing that Communication apps get downloaded more relative to the amount of apps on the store.

From the apple store, a total of 78 million apps were rated. With Games (53.4%), Social Networking (9.5%), and Photo & Video (5.7%). Soical Networking is a new catigory in the top three.

The overall trend stated in the genre count follows for the number of users except that google also has a lot of people that download apps for there usefulness, like Communication and Tools.