**Profitable App Profiles for the App Store and Google Play Markets**

Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.

This notebook analyzes 2 datasets:


A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018.

A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

In [1]:
from csv import reader

In [2]:
### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
#Display android column names
android_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [5]:
#Exploring the first 3 rows of android and its shape
explore_data(android, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The google play android dataset contains 10841 rows and 13 columns. The columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

In [6]:
#Display App store column names
ios_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [7]:
#Exploring the first 3 rows of App Store and its shape
explore_data(ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


The App Store dataset contains 7197 rows and 16 columns. The columns that might be useful for the purpose of our analysis are 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'.

**DATA Cleaning**

The Google Play data set has a dedicated discussion section, and we can see that one of the discussions outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct.

In [8]:
print(android_header)
print('\n')
print(android[10472]) #The Category column is missing! We will delete this row.
del android[10472] #deleting row 10472

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


**Removing Duplicate Entries**

From the discussion, there seems to be apps that are duplicated in the dataset. Upon inspection, the reason for this seems to be that the number of reviews is creating duplicate entries for the apps. We will first locate the duplicate apps and only save the row with the most amount of reviews.

In [9]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


To remove duplicate entries, we will:

-Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app


-Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

In [10]:
reviews_max = {} #empty dictionary for storing max number of reviews per app

In [11]:
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

    We start by initializing two empty lists, android_clean and already_added.

    We loop through the android data set, and for every iteration:
    
    We isolate the name of the app and the number of reviews.
    
    We add the current row (app) to the android_clean list, and the app name (name) to the already_cleaned list if:
    
        -The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max dictionary; and
        
        -The name of the app is not already in the already_added list. We need to add this supplementary condition to account for those cases where the highest number of reviews of a duplicate app is the same for more than one entry (for example, the Box app has three entries, and the number of reviews is the same). If we just check for reviews_max[name] == n_reviews, we'll still end up with duplicate entries for some apps



In [12]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        

**Removing Non-English Apps**

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

In [13]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
        
        if non_ascii > 3:
            return False
        else:
            return True

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [14]:
android_clean_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_clean_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)

In [15]:
print(len(android_clean_english)) #print number of rows in the clean english
print(len(ios_english)) #print number of rows in the ios english dataset

9659
7197


**Identifying Free Apps**

In [17]:
f_android_clean_english = []
f_ios_clean_english = []

for app in android_clean_english:
    if app[7] == 'Free' or app[7] == '0':
        f_android_clean_english.append(app)

for app in ios_english:
    if app[4] == 'Free' or app[4] == '0.0':
        f_ios_clean_english.append(app)
        
print(len(f_android_clean_english)) #Final Google Play store dataset length
print(len(f_ios_clean_english)) #Final IOS dataset Length

8905
4056


Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.



In [None]:
def freq_table(dataset, index):
    