# Profitable App Profiles for the App Store and Google Play Markets

#### Author: Xiaofan Hu
#### Date 04-16-2020

The goal of this project is to analyze data to help our developers understand what type of apps are likely to attract more users. I will analyze two of most popular App stores, Apple Store and Google Play.

The goal of doing this project is to understand:
- The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions
- Jupyter Notebook

## Project Introduction
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.
![title](py1m8_statista.png)

Google Play data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.
Apple Store data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

## Explore the data

We can read the data by using reader function from csv package:

In [22]:
from csv import reader

### The Apple Store data set ###
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

### The Google data set ###
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

Before processing the data, we need to explore the data and to see how them look like. And we have defined a function to explore the data called *explore_data*

In [40]:
print('Apple Store Headers: ', apple_header)
print('Google Play Headers: ', google_header)

Apple Store Headers:  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Google Play Headers:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [23]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Explore the Apple Store Data

In [24]:
explore_data(apple, 0, 5, rows_and_columns=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


Explore the Google Play Data

In [25]:
explore_data(google, 0, 5, rows_and_columns=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


Check if any of the rows miss any values

In [26]:
def check_rows(data_list):
    n_row0 = len(data_list[0])
    i = 0
    for row in data_list:
        n_row = len(row)
        i += 1
        if n_row != n_row0:
            print(i-1, row)

In [27]:
check_rows(apple)

In [28]:
check_rows(google)

10472 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We found that in google play's data, the 10472th line is missing "Category" value, which causes the column shift. We need to remove that line (Or you can make assumption for the missing values).

In [29]:
del google[10472]

Check aagain to make sure all the data is consistent.

In [30]:
check_rows(google)

Some duplicates from the data which need to be removed.

In [31]:
for app in google:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can use the dictionary to find out the duplicates in both Apple and Google files.

In [32]:
def cal_duplicate(data_list):
    dict_list = {}
    unique_app = 0
    duplicate_app = 0
    for row in data_list:
        app_name = row[0]
        if app_name in dict_list:
            dict_list[app_name] += 1
            duplicate_app += 1
        else:
            dict_list[app_name] = 1
            unique_app += 1
    print(duplicate_app, ' Duplicate Apps')
    print(unique_app, ' Unique Apps')

In [33]:
cal_duplicate(apple)

0  Duplicate Apps
7197  Unique Apps


In [34]:
cal_duplicate(google)

1181  Duplicate Apps
9659  Unique Apps


We would like to remove the duplicate lines from the google file. After examing the data, we found the 4th column is number of views which also indicate the time of the app. So when we remove the duplcated apps, we keep the latest one (the app has largest number of reviews). But first, we need to keep the orginal google file.

In [35]:
google_orignal = google

To remove the duplicates, we first need to create a dictionary what the key is the app name and the value is the maximum number of reviews. Then remove the duplicates based on the dictionary.
Create dictionary.

In [36]:
google_review_dict = {}

for row in google_orignal:
    app_name = row[0]
    review = float(row[3])
    if app_name not in google_review_dict:
        google_review_dict[app_name] = review
    elif (app_name in google_review_dict) and (review > google_review_dict[app_name]):
        google_review_dict[app_name] = review

print(len(google_review_dict))

9659


Remove the duplicate

In [37]:
google = []
google_added = []

for row in google_orignal:
    app_name = row[0]
    review = float(row[3])
    if (app_name not in google_added) and (google_review_dict[app_name] == review):
        google.append(row)
        google_added.append(app_name)
    
print(len(google))

9659


We would like to remove non-English Apps' information from Apple Store file. First, build list to see if an app has English name.

In [38]:
check_english = []

# 0 is English app
# 1 is non-English app

for row in apple:
    name = str(row[1])
    temp = []
    for character in name:
        if ord(character) > 127:
            temp.append(1)
        else:
            temp.append(0)
    if sum(temp) <= 1:
        check_english.append(0)
        row.append(0)
    else:
        check_english.append(1)
        row.append(1)
        
print(len(check_english))
# print(check_english)

i = 0
for number in check_english:
    if number == 0:
        i += 1
print(i)

7197
6100


Remove non-English apps.

In [39]:
apple_orignal = apple
apple = []

for row in apple_orignal:
    if row[16] == 0:
        apple.append(row)
        
print(len(apple))
print(apple[:5])

6100
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1', 0], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1', 0], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1', 0], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1', 0], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1', 0]]


Find free apps from both Apple Store and Google Play

In [43]:
free_apple = []
free_google = []

for row in apple:
    price = float(row[4])
    if price == 0:
        free_apple.append(row)
        
for row in google:
    price = row[6]
    if price == 'Free':
        free_google.append(row)
        
print('Number of free apps in Apple Store: ', len(free_apple))
print('Number of free apps in Google Play: ', len(free_google))

print(free_apple[:5])
print(free_google[:5])

Number of free apps in Apple Store:  3169
Number of free apps in Google Play:  8904
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1', 0], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1', 0], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1', 0], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1', 0], ['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1', 0]]
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U L

Define a function called freq_table to store the item frequency.

In [48]:
def freq_table(list_table, n_col):
    col_list = []
    for row in list_table:
        col_list.append(row[n_col])
        
    col_dict = {}
    for n in col_list:
        if n in col_dict:
            col_dict[n] += 1
        else:
            col_dict[n] = 1
    return col_dict

In [49]:
apple_prime_genre = freq_table(free_apple, 11)
print(apple_prime_genre)

{'Social Networking': 104, 'Photo & Video': 160, 'Games': 1855, 'Music': 65, 'Reference': 17, 'Health & Fitness': 63, 'Weather': 27, 'Utilities': 76, 'Travel': 36, 'Shopping': 80, 'News': 42, 'Navigation': 6, 'Lifestyle': 49, 'Entertainment': 248, 'Food & Drink': 26, 'Sports': 69, 'Book': 12, 'Finance': 35, 'Education': 118, 'Productivity': 54, 'Business': 17, 'Catalogs': 4, 'Medical': 6}


In [53]:
google_genre = freq_table(free_google, 9)
google_category = freq_table(free_google, 1)

print(google_genre)
print('\n')
print(google_category)

{'Art & Design': 54, 'Art & Design;Creativity': 6, 'Auto & Vehicles': 82, 'Beauty': 53, 'Books & Reference': 194, 'Business': 408, 'Comics': 55, 'Comics;Creativity': 1, 'Communication': 288, 'Dating': 165, 'Education': 480, 'Education;Creativity': 4, 'Education;Education': 31, 'Education;Pretend Play': 5, 'Education;Brain Games': 3, 'Entertainment': 542, 'Entertainment;Brain Games': 7, 'Entertainment;Creativity': 3, 'Entertainment;Music & Video': 15, 'Events': 63, 'Finance': 328, 'Food & Drink': 110, 'Health & Fitness': 273, 'House & Home': 73, 'Libraries & Demo': 83, 'Lifestyle': 349, 'Lifestyle;Pretend Play': 1, 'Card': 40, 'Arcade': 164, 'Puzzle': 100, 'Racing': 88, 'Sports': 307, 'Casual': 156, 'Simulation': 184, 'Adventure': 61, 'Trivia': 38, 'Action': 275, 'Word': 23, 'Role Playing': 83, 'Strategy': 81, 'Board': 34, 'Music': 18, 'Action;Action & Adventure': 9, 'Casual;Brain Games': 12, 'Educational;Creativity': 3, 'Puzzle;Brain Games': 15, 'Educational;Education': 35, 'Casual;Pre

Create a display_table function to sort the frequency table.

In [58]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [59]:
display_table(free_apple, 11) # Display Apple Store Prime genre

Games : 1855
Entertainment : 248
Photo & Video : 160
Education : 118
Social Networking : 104
Shopping : 80
Utilities : 76
Sports : 69
Music : 65
Health & Fitness : 63
Productivity : 54
Lifestyle : 49
News : 42
Travel : 36
Finance : 35
Weather : 27
Food & Drink : 26
Reference : 17
Business : 17
Book : 12
Navigation : 6
Medical : 6
Catalogs : 4


In [60]:
display_table(free_google, 9) # Display Google Play genre

Tools : 750
Entertainment : 542
Education : 480
Business : 408
Lifestyle : 349
Productivity : 346
Finance : 328
Medical : 313
Sports : 307
Personalization : 295
Communication : 288
Action : 275
Health & Fitness : 273
Photography : 262
News & Magazines : 252
Social : 236
Travel & Local : 206
Shopping : 200
Books & Reference : 194
Simulation : 184
Dating : 165
Arcade : 164
Video Players & Editors : 158
Casual : 156
Maps & Navigation : 126
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 61
Comics : 55
Art & Design : 54
Beauty : 53
Parenting : 44
Card : 40
Trivia : 38
Casino : 38
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 31
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [61]:
display_table(free_google, 1) # Display Google Play category

FAMILY : 1689
GAME : 864
TOOLS : 751
BUSINESS : 408
LIFESTYLE : 350
PRODUCTIVITY : 346
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 295
COMMUNICATION : 288
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 262
NEWS_AND_MAGAZINES : 252
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 200
BOOKS_AND_REFERENCE : 194
DATING : 165
VIDEO_PLAYERS : 160
MAPS_AND_NAVIGATION : 126
FOOD_AND_DRINK : 110
EDUCATION : 104
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 58
COMICS : 56
BEAUTY : 53


Calculate the total installations (total rating count in Apple Store) per genre.

In [90]:
def cal_total_install(dataset, genre_index, install_number):
    genre_dict = freq_table(dataset, genre_index)
    install_dict = {}
    install_per_genre_dict = {}
    
    for row in dataset:
        genre_name = row[genre_index]
        install_number_per_app = float((row[install_number]))
        
        if genre_name in install_dict:
            install_dict[genre_name] += install_number
        else:
            install_dict[genre_name] = install_number
        
    for key in genre_dict:
        install_per_genre_dict[key] = float(install_dict[key]) / float(genre_dict[key])
        
        
    return genre_dict, install_dict

In [91]:
apple_genre_dict, apple_install_dict = cal_total_install(free_apple, 11, 5)
print(apple_genre_dict)
print('\n')
print(apple_install_dict)

{'Social Networking': 104, 'Photo & Video': 160, 'Games': 1855, 'Music': 65, 'Reference': 17, 'Health & Fitness': 63, 'Weather': 27, 'Utilities': 76, 'Travel': 36, 'Shopping': 80, 'News': 42, 'Navigation': 6, 'Lifestyle': 49, 'Entertainment': 248, 'Food & Drink': 26, 'Sports': 69, 'Book': 12, 'Finance': 35, 'Education': 118, 'Productivity': 54, 'Business': 17, 'Catalogs': 4, 'Medical': 6}


{'Social Networking': 520, 'Photo & Video': 800, 'Games': 9275, 'Music': 325, 'Reference': 85, 'Health & Fitness': 315, 'Weather': 135, 'Utilities': 380, 'Travel': 180, 'Shopping': 400, 'News': 210, 'Navigation': 30, 'Lifestyle': 245, 'Entertainment': 1240, 'Food & Drink': 130, 'Sports': 345, 'Book': 60, 'Finance': 175, 'Education': 590, 'Productivity': 270, 'Business': 85, 'Catalogs': 20, 'Medical': 30}


Convert dictionary to list

In [107]:
def dict_to_list(dictionary):
    dict_list = []
    for key, value in dictionary.items():
        temp = [key, value]
        dict_list.append(temp)
    return dict_list