# Profitable App Profiles for the App Store and Google Play Markets

Our goal for this project is to build data-driven analysis to assist the company's developers that builds Android and iOS mobile apps with making profitable decisions on understanding what type of application are likely to attract more users.

At the company, we only build apps that are free to download and install, and the main revenue source consists of in-app ads, therefore, the number of users of our apps can greatly impact the incoming source of revenue.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll try to analyze a sample of data instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our purpose:

* A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this link.
* A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
from csv import reader

# The Apple Store dataset
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple_dataset = list(read_file)
apple_header = apple_dataset[0]
apple_dataset = apple_dataset[1:]

# The Google Play store dataset
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_dataset = list(read_file)
google_header = google_dataset[0]
google_dataset = google_dataset[1:]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


print(apple_header)        
print("\n")
explore_data(apple_dataset,0,5, True)


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


Apple Store dataset consists of 7197 rows (excluding the header) and 16 columns, of which we are going to select `track_name`, `price`, `size_bytes`, `rating_count_total`, `user_rating`, `cont_rating`, and `prime_genre`, as our main columns for the analysis

In [2]:
print(google_header)        
print("\n")
explore_data(google_dataset,0,5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

Google Store dataset consists of 10841 (excluding the header) and 13 columns, of which we are going to select `Apps`, `price`, `reviews`, `size`, `Installs`, `content rating`, and `Genre`, as our main columns for the analysis.

**Deleting Wrong Data**

The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row. 

In [3]:
print(google_header) 
print('\n')
print(google_dataset[10472])
print('\n')
print('Correct Row:','\n', google_dataset[0])
      

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Correct Row: 
 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The row 10472nd's list contains 12 elements, whereas the correct row would have 13, we can see that row 10472nd misses the value in `Category` column , as a consequence, we are going to delete this row. 

In [4]:
print(len(google_dataset))
del google_dataset[10472]  # don't run this more than once
print(len(google_dataset))

10841
10840


**Removing Duplicate Entries**

**Part One**

Some apps have duplicate entries. We need to remove the duplicate entries and keep only one per app. For instance, Instagram has 4 different entries:



In [5]:
for app in google_dataset:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [6]:
google_unique_app = []
google_duplicate_app = []
for row in google_dataset:
    app_name = row[0]
    if app_name in google_unique_app:
        google_duplicate_app.append(app_name)
    else:
        google_unique_app.append(app_name)
print('Numbers of duplicated apps: ', len(google_duplicate_app))
print('\n')
print('Example of duplicate apps: ', google_duplicate_app[:10])

Numbers of duplicated apps:  1181


Example of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


**Part Two**

If we examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times. Thus, we want to keep the row with highest number of reviews since it means the data is most recent. 

To remove the duplicates, we will do the following:
* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

* Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [7]:
reviews_max = {}
for row in google_dataset:
    name = row[0]
    reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < reviews:
        reviews_max[name] = reviews
    elif name not in reviews_max:
        reviews_max[name] = reviews

            

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [8]:
print('Actual length: ', len(reviews_max))

Actual length:  9659


Now, let's use the `reviews_max` dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

* We start by initializing two empty lists, `google_clean` and `already_added`.
* We loop through the google data set, and for every iteration:
    * We isolate the name of the app and the number of reviews.
    * We add the current row (`row`) to the `google_clean` list, and the app name (`name`) to the `already_added` list if:
    * The number of reviews of the current app matches the number of reviews of that app as described in the `reviews_max` dictionary; and
    * The name of the app is not already in the `already_added` list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for `reviews_max[name] == n_reviews`, we'll still end up with duplicate entries for some apps.

In [9]:
google_clean = [] #store new dataset
already_added = [] #store app names

for row in google_dataset:
    name = row[0]
    reviews = float(row[3])
    if reviews == reviews_max[name] and name not in already_added:
        google_clean.append(row)
        already_added.append(name)

explore_data(google_clean,0,5,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13


**Removing Non-English Apps**

**Part One**

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

Each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the `ord()` built-in function.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.

In [10]:
def eng_chr(string):
    non_eng_chr = 0
    for character in string:
        if ord(character) > 127:
            non_eng_chr += 1
    if non_eng_chr > 3:
        return False
    else:
        return True

print(eng_chr("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(eng_chr('Instagram'))
print(eng_chr('Docs To Go™ Free Office Suite'))
print(eng_chr('Instachat 😜'))

False
True
True
True


In [11]:
google_english = []
apple_english = []

for row in google_clean:
    name = row[0]
    if eng_chr(name):
        google_english.append(row)

for row in apple_dataset:
    name = row[2]
    if eng_chr(name):
        apple_english.append(row)

explore_data(google_english, 0, 3, True)
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

**Isolating Free Apps**

In this project, we are only building apps that are free to download and install, and our main source of revenue consists of in-app ads. Our dataset contains both free and non-free apps; we only need to keep the ones with no cost for our analysis.

In [12]:
google_free_dataset = [row for row in google_english if row[7] == '0']
apple_free_dataset = [row for row in apple_english if row[4] == '0.0']
google_dataset = google_free_dataset
apple_dataset = apple_free_dataset
explore_data(google_dataset,0,6,True)
explore_data(apple_dataset,0,6,True)

        

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+

**Most Common Apps by Genre** 

**Part One**

So far in the data cleaning process, we've done the following:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* Isoating the free apps

As mentioned, our goal is to determine the kinds of apps that are likely to attract more users since our revenue is tied directly to the number of people using our apps.

In order to find the most appealing and attractive app to users, we need to look the one that is the most successful among Google Play market and Apple Store market. Thus, we come up with a validation strategy entails 3 steps:

1. Build a minimal Android version of the app, and add it to Google Play - because there are relatively more data and users available in Google Play store market.

2. If the app has good response from users, we develop it further - when the app is popular at certain extent, users will expect a better developed version to be released.

3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store - to expand our market, attract more users from iOS platform and ultimately generate more revenue.



In [None]:
#googlerow[1] = 'Category'
#googlerow[9] = 'Genre'
#applerow[11] = 'prime_genre'
