# Profitable App Profiles for the App Store and Google Play Markets

## Step 1 - Analyzing Mobile App data

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## Step 2 - Opening and Exploring the Data

In this stage, I will explore the data provided. First, I have a function to get the data set, and the explore_data function to preview the data.

In [60]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


Here is a preview of the Android data, including headings

In [61]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Here is the iOS data, including headings

In [62]:
print(ios_header)
print('\n')
explore_data(ios, 1, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


Our Goal is to understand what types of apps attract more users, so therefore the useful columns in each data set will be:

Android:
- App 
- Category
- Installs
- Genres
- Reviews
- Price

iOS:
- Prime Genre
- Track Name
- Price
- Rating count total

## Step 3 - Deleting wrong data

In this step, we will perform data cleaning, we are only interested in apps that are:
- Free
- For an English Speaking Audience

There is also a discussion that mentions that there is an error in a certain row of the data.

Let's print the incorrect row and compare it to a correct row

In [63]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The app 'Life Made WI-Fi Touchscreen Photo Frame' has a rating of 19, but the maximum possible rating is 5.

In [64]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


## Step 4 - Removing duplicate entries (part one)

There are some apps that have duplicate entries, for example Instagram

In [65]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


There are a total of 1181 cases where an app occurs more than once

In [66]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


Next we will remove the duplicates using the following criteria:
- if there are multiple apps, keep the one with the highest number of reviews and remove the other entries

The reasoning for this is that the one with the highest number of reviews should be the most recent data for the app.

## Step 5 - Removing duplicate entries (part two)

To remove the duplicates, we will do the following:
- Create a dictionary where each dictionary key is a unique app name and the value is the highest number of reviews for that app.
- Use the information in the dictionary to create a new dataset which will only have one entry per app

In [67]:
reviews_max = {}

for app in android:
    name = app[0]
    number_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < number_reviews:
        reviews_max[name] = number_reviews
    elif name not in reviews_max:
        reviews_max[name] = number_reviews

print('Expected Length: ', len(android)-1181)
print('Actual Length: ', len(reviews_max))

Expected Length:  9659
Actual Length:  9659


It looks like the function above is working as expected, next let's use the dictionary created to remove the duplicate rows

The code below sets up two lists, android clean (that will be used later), and already_added, to keep track of the apps added to that list.

The for loop goes through the apps in the 'android' list. 
It takes the name of the app, and the number of reviews.
If the number of reviews matches the reviews max value for that app (i.e. it's the listing for the app with the most reviews), and it's not already in android_clean it will be added to android clean
Finally, if added to android_clean it will also be added to already_added, so it won't be added again.

In [68]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    number_reviews = float(app[3])
    if (reviews_max[name] == number_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


The number of rows matches the expected number, as does the number of columns, so we're ready to proceed to the next step of data cleaning. Checking for non-english apps.

## Step 6: Removing Non-English Apps (part one)

Some of the apps are not aimed at an english speaking audience, here are some examples:

In [69]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


To remove non-english apps, we will check for app names that contain non-ASCII characters. Ascii characters appear in the 0-127 range.

Here's the code to check the value of a character:

In [70]:
# The Letter 'a' has an ASCII value of 97
print(ord('a'))
# The character '艺' has an ASCII value of 33616
print(ord('艺'))

97
33402


So, if a name contains a character that has a value > 127, we can assume it's a non-english name. Let's make a function to check for this:

In [71]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
False
False


## Step 7: Removing Non-English Apps (part two)

Unfortunately, this initial approach does not work, as special characters like ™ and emojis fall outside the english ASCII range. So, we will update the function to check if a name contains 3 or more non-ASCII characters. It's not a perfect approach, but should work well enough.

In [72]:
def is_english(string):
    non_ascii = 0
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True
            
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


This new function appears to work for our examples, let's try it on a real dataset.

In [73]:
print('Android length before cleaning: ', len(android_clean))
print('iOS length before cleaning: ', len(ios)) 

android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        

for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

print('\n')
print('Android length after cleaning: ', len(android_clean)) 
print('iOS length after cleaning: ', len(ios))

Android length before cleaning:  9659
iOS length before cleaning:  7197


Android length after cleaning:  9659
iOS length after cleaning:  7197


## Step 8 - Isolating the free apps

So far we have:
- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps

The final stage to prepare the data for our analysis is to isolate the free apps. 

For android apps, price is stored in column 7.
For iOS apps, price is stored in column 4.

In [None]:
print('Android length before cleaning: ', len(android_clean))
print('iOS length before cleaning: ', len(ios)) 

android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)

print('\n')      
print('Android length after cleaning ', len(android_final))
print('iOS length after cleaning ',len(ios_final))

Android length before cleaning:  9659
iOS length before cleaning:  7197
8864
3222
