   **Profitable App Profiles for the App Store and Google Play Markets**

The goal of the project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We are working as data analysts for a company that builds Android and iOS mobile apps. We want to help them to understand what type of apps are likely to attract more users. 

The company only build apps that are free to download and install so the main source of revenue will be in-app ads. 

In [16]:
opened_AppleStore=open('AppleStore.csv')
opened_GooglePlayStore=open('googleplaystore.csv')

from csv import reader
read_file_AS=reader(opened_AppleStore)
read_file_GPS=reader(opened_GooglePlayStore)
apps_AS=list(read_file_AS)
apps_GPS=list(read_file_GPS)
ios_header = apps_AS[0]
ios = apps_AS[1:]
android_header = apps_GPS[0]
android = apps_GPS[1:]




In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(apps_AS,0,4)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




In [4]:
explore_data(apps_GPS,0,4)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']




In [5]:
#number of rows and columns for AppleStore
print('Number of rows:', len(apps_AS))
print('Number of columns:', len(apps_AS[0]))

Number of rows: 7198
Number of columns: 16


In [6]:
#number of rows and columns for GooglePlayStore
print('Number of rows:', len(apps_GPS))
print('Number of columns:', len(apps_GPS[0]))

Number of rows: 10842
Number of columns: 13


In [7]:
#print column names to identify which ones can help with our analysis
print(apps_AS[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


The columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Not all column names are descriptive enough. Here is a link to the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

In [19]:
#check the row with an error
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [20]:
#delete the row
del(android[10472])

In [23]:
#When looking closely at the data we observe some duplicates like for Instagram:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [25]:
#Maybe Instagram is not the only duplicate in android data so let's count the number of duplicate rows:
duplicate=[]
unique_apps=[]

for app in android: 
    name=app[0]
    if name in unique_apps:
        duplicate.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps', len(duplicate))
    

Number of duplicate apps 1181


There are 1181 duplicates in the android data. For each duplicate, we will keep the row with the most number of reviews as this should indicate that the data contained in the row is more recent.

In [35]:
reviews_max = {}
for app in android:  
    name=app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
        
print(len(reviews_max))

9659


In the next step, we will remove duplicates by creating two lists: android_clean and already_added. By looping in the android database we will make sure that we keep the row with the most reviews for each duplicate. 

In [112]:
android_clean=[]
already_added=[]
for app in android:
    name=app[0]
    n_reviews=float(app[3])
    if (n_reviews==reviews_max[name]) and (name not in already_added) :
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))

9659


We write a function that check if the string is english or not. Indeed, since we are targeting an english audience we want to remove any non-english app name. 

In [93]:
def english1(a_string):
    for character in a_string:
        if ord(character) > 127:
            return False
    
    return True

#let's check that our function works with some examples
print(english1('Instagram'))
print(english1('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english1('Docs To Go™ Free Office Suite'))
print(english1('Instachat 😜'))

True
False
False
False


We encounter a problem since our function is labelling some obviously English apps as non English because they contain special characters such as smileys. Hence we should redefine our function to make it less picky. Let's label apps as non english only if their names containe more than three characters with corresponding numbers falling outside the ASCII range. 

In [97]:
def is_english(a_string):
    non_english_character=0
    for character in a_string:
        if ord(character) > 127:
            non_english_character+=1
            
    if non_english_character>3:
            return False
    else:
            return True

#let's check that our new function works using previous examples
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [114]:
android_english_app=[]
android_non_english_app=[]

for app in android_clean:
    name=app[0]
    if is_english(name):
        android_english_app.append(app)
    else:
        android_non_english_app.append(app)

ios_english_app=[]
ios_non_english_app=[]
for app in ios:
    name=app[1]
    if is_english(name):
        ios_english_app.append(app)
    else:
        ios_non_english_app.append(app)
        
print(len(android_english_app))
print(len(android_non_english_app))

explore_data(android_english_app, 0, 3, True)
print('\n')
explore_data(ios_english_app, 0, 3, True)


9614
45
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12'

We have 9614 english apps in android and 6183 english apps in ios. For the last data cleaning action, we will isolate free apps. 

In [140]:
android_final=[]
for app in android_english_app:
    price=app[6]
    if price == 'Free':
        android_final.append(app)

print(len(android_final))

ios_final=[]
for app in ios_english_app:
    price=app[4]
    if price=='0.0':
        ios_final.append(app)

print(len(ios_final))

8863
3222


As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

    Build a minimal Android version of the app, and add it to Google Play.
    If the app has a good response from users, we then develop it further.
    If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

In [152]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

print(display_table(ios_final,11))
print('\n')
print(display_table(android_final,1))
print('\n')
print(display_table(android_final,-4))

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None


FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.31716123208

It seems like the App Store is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity) are more rare.


We observe another tendancy for the Google Play store where practical apps seem to have a much better representation. 

We can't recommand an app profile for either store yet because we don't know how the users are allocated.

In [156]:
prime_genre_table=freq_table(ios_final,11)

for genre in prime_genre_table:
    total=0
    len_genre=0
    for app in ios_final:
        genre_app=app[11]
        if genre==genre_app:
            total+=float(app[5])
            len_genre+=1
    average_number_user_ratings=total/len_genre
    print(genre,average_number_user_ratings)
    
    
    

Food & Drink 33333.92307692308
Medical 612.0
Sports 23008.898550724636
Music 57326.530303030304
Shopping 26919.690476190477
Catalogs 4004.0
Book 39758.5
Photo & Video 28441.54375
Finance 31467.944444444445
Weather 52279.892857142855
Lifestyle 16485.764705882353
Travel 28243.8
Games 22788.6696905016
Productivity 21028.410714285714
Navigation 86090.33333333333
News 21248.023255813954
Health & Fitness 23298.015384615384
Education 7003.983050847458
Business 7491.117647058823
Social Networking 71548.34905660378
Utilities 18684.456790123455
Entertainment 14029.830708661417
Reference 74942.11111111111
