# Profitable App Profiles for the App Store and Google Play Markets

- We'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

- We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the adds, the better.

-  Our goal for this project is to analyze data to help our Mobile app developers understand what kinds of apps are likely to attract more users.






## Dataset links:

- 'googleplaystore.csv'   [Dataset](https://www.kaggle.com/lava18/google-play-store-apps/home) 
- 'AppleStore.csv'  [Dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

In [2]:
ios_open_file=open('AppleStore.csv')
and_open_file=open('googleplaystore.csv')

from csv import reader

ios_read=reader(ios_open_file)
and_read=reader(and_open_file)

ios_apps_data=list(ios_read)
and_apps_data=list(and_read)
ios_head=ios_apps_data[0]
and_head=and_apps_data[0]

ios_apps_data=ios_apps_data[1:]
and_apps_data=and_apps_data[1:]

def explore(dataset,start,end,rows_n_columns=False): 
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
    if rows_n_columns:
        print("Number of rows ",len(dataset))
        print("Number of Columns",len(dataset[0]))
        print("\n")

explore(ios_apps_data,0,6,True)
explore(and_apps_data,0,6,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']


Number of rows  7197
Number of Columns 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+

# Data Cleaning 

- As company needs only free apps and english language apps we need to clean the unrelated data for further analysis.

In [3]:
#print(and_apps_data[10472])
del and_apps_data[10472]
print(and_apps_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


# Duplicate check

- Here we are categorising the unique and duplicates
- After that we remove the duplicate apps
- Duplicates are removed based on the higher number of reviews 

In [4]:
def divide_apps(dataset):
    duplicate=[]
    unique=[]
    dup_count={}
    for row in dataset:
        if row[0] in unique:
            duplicate.append(row[0])
        else:
            unique.append(row[0])
    
    print(len(duplicate))
    return duplicate,unique

duplicate,unique=divide_apps(and_apps_data)
    

1181


In [5]:
print(and_head)

print(ios_head)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [6]:
reviews_max={}
for row in and_apps_data:
    name=row[0]
    n_reviews=float(row[3])
    if name in reviews_max:
        if reviews_max[name] < n_reviews:
            reviews_max[name]=n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews
print(len(reviews_max))
    

9659


In [7]:
android_clean=[]
already_added=[]

for row in and_apps_data:
    name=row[0]
    n_reviews=float(row[3])
    if n_reviews==reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
explore(android_clean,0,6,True)       
    
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+

In [8]:
print(len(already_added))

9659


In [9]:
def charCheck(string):
    flag=False
    count=0
    for i in string:
        if ord(i)>=0 and ord(i)<=127:
            flag=True
        else:
            count+=1
            if count==4:
                return False
                
            
    return flag

print(charCheck('Instagram'))
print(charCheck('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(charCheck('Docs To Go™ Free Office Suite'))
print(charCheck('Instachat 😜'))
            

True
False
True
True


In [10]:
def filter_non_eng(dataset1,dataset2):
    
    new_list1=[]
    new_list2=[]
    for row in dataset1:
        name=row[0]
        if(charCheck(name)):
            new_list1.append(row)
    for row in dataset2:
        name=row[1]
        if(charCheck(name)):
            new_list2.append(row)
    
    return new_list1,new_list2
            
    

In [11]:
android_clean,ios_clean=filter_non_eng(android_clean,ios_apps_data)
explore(android_clean,0,6,True)       
explore(ios_clean,0,6,True)       


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+

In [14]:
ios_apps=[] #4
and_apps=[] #7

for row in ios_clean:
    price=float(row[4])
    if price == 0:
        ios_apps.append(row)
        
        

for row in android_clean:
    price=row[6]
    if price.upper() =='FREE' :
        and_apps.append(row)
        
print(len(ios_apps))
print(len(and_apps))
    
    


3206
8863


## So far in the data cleaning process, I : 

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps
- Isolated the free apps

** To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps: **


- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we then develop it further.
- If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

In [15]:
print(and_head) #9
print(ios_head) #11

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [17]:
def freq_table(dataset,index):
    freq={}
    for row in dataset:
        if row[index] in freq:
            freq[row[index]]+=1
        elif row[index] not in freq:
            freq[row[index]]=1
            
    return freq
            
    

In [18]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        


In [20]:
display_table(ios_apps,11)
#display_table(and_apps,9)
#display_table(and_apps,1)

Games : 1865
Entertainment : 253
Photo & Video : 160
Education : 118
Social Networking : 104
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 50
News : 43
Travel : 39
Finance : 35
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 13
Navigation : 6
Medical : 6
Catalogs : 4


In [21]:
display_table(and_apps,9)

Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 313
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 80
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [22]:
display_table(and_apps,1)

FAMILY : 1675
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


In [23]:
ios_freq=freq_table(ios_apps,11)

In [24]:
print(ios_freq)

{'News': 43, 'Shopping': 84, 'Finance': 35, 'Sports': 69, 'Health & Fitness': 65, 'Games': 1865, 'Catalogs': 4, 'Productivity': 56, 'Food & Drink': 26, 'Travel': 39, 'Medical': 6, 'Social Networking': 104, 'Weather': 28, 'Business': 17, 'Reference': 18, 'Education': 118, 'Navigation': 6, 'Utilities': 81, 'Photo & Video': 160, 'Music': 66, 'Lifestyle': 50, 'Book': 13, 'Entertainment': 253}


In [29]:
avg=[]
for genre in ios_freq:
    total=0
    len_genre=0
   
    for row in ios_apps:
        genre_app=row[11]
        if genre_app==genre:
            n_ratings=float(row[5])
            total+=n_ratings
            len_genre+=1

    avg.append((total/len_genre,genre))
print(avg)
    
            

[(21248.023255813954, 'News'), (26919.690476190477, 'Shopping'), (32367.02857142857, 'Finance'), (23008.898550724636, 'Sports'), (23298.015384615384, 'Health & Fitness'), (22898.638605898122, 'Games'), (4004.0, 'Catalogs'), (21028.410714285714, 'Productivity'), (33333.92307692308, 'Food & Drink'), (28964.05128205128, 'Travel'), (612.0, 'Medical'), (72916.54807692308, 'Social Networking'), (52279.892857142855, 'Weather'), (7491.117647058823, 'Business'), (74942.11111111111, 'Reference'), (7003.983050847458, 'Education'), (86090.33333333333, 'Navigation'), (18684.456790123455, 'Utilities'), (28441.54375, 'Photo & Video'), (57326.530303030304, 'Music'), (16815.48, 'Lifestyle'), (42816.846153846156, 'Book'), (14085.284584980238, 'Entertainment')]


In [30]:
print(sorted(avg))

[(612.0, 'Medical'), (4004.0, 'Catalogs'), (7003.983050847458, 'Education'), (7491.117647058823, 'Business'), (14085.284584980238, 'Entertainment'), (16815.48, 'Lifestyle'), (18684.456790123455, 'Utilities'), (21028.410714285714, 'Productivity'), (21248.023255813954, 'News'), (22898.638605898122, 'Games'), (23008.898550724636, 'Sports'), (23298.015384615384, 'Health & Fitness'), (26919.690476190477, 'Shopping'), (28441.54375, 'Photo & Video'), (28964.05128205128, 'Travel'), (32367.02857142857, 'Finance'), (33333.92307692308, 'Food & Drink'), (42816.846153846156, 'Book'), (52279.892857142855, 'Weather'), (57326.530303030304, 'Music'), (72916.54807692308, 'Social Networking'), (74942.11111111111, 'Reference'), (86090.33333333333, 'Navigation')]


In [31]:
display_table(and_apps, 5) # the Installs columns


1,000,000+ : 1394
100,000+ : 1024
10,000,000+ : 935
10,000+ : 904
1,000+ : 744
100+ : 613
5,000,000+ : 605
500,000+ : 493
50,000+ : 423
5,000+ : 400
10+ : 314
500+ : 288
50,000,000+ : 204
100,000,000+ : 189
50+ : 170
5+ : 70
1+ : 45
500,000,000+ : 24
1,000,000,000+ : 20
0+ : 4


In [33]:
categories_android = freq_table(and_apps, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in and_apps:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

EVENTS : 253542.22222222222
SOCIAL : 23253652.127118643
SPORTS : 3638640.1428571427
VIDEO_PLAYERS : 24727872.452830188
LIFESTYLE : 1437816.2687861272
HOUSE_AND_HOME : 1331540.5616438356
FAMILY : 3697848.1731343283
WEATHER : 5074486.197183099
PHOTOGRAPHY : 17840110.40229885
TRAVEL_AND_LOCAL : 13984077.710144928
FOOD_AND_DRINK : 1924897.7363636363
COMMUNICATION : 38456119.167247385
AUTO_AND_VEHICLES : 647317.8170731707
PERSONALIZATION : 5201482.6122448975
FINANCE : 1387692.475609756
LIBRARIES_AND_DEMO : 638503.734939759
ART_AND_DESIGN : 1986335.0877192982
GAME : 15588015.603248259
BEAUTY : 513151.88679245283
TOOLS : 10801391.298666667
SHOPPING : 7036877.311557789
BOOKS_AND_REFERENCE : 8767811.894736841
PRODUCTIVITY : 16787331.344927534
MEDICAL : 120550.61980830671
NEWS_AND_MAGAZINES : 9549178.467741935
HEALTH_AND_FITNESS : 4188821.9853479853
COMICS : 817657.2727272727
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
MAPS_AND_NAVIGATION : 4056941.

In [34]:
under_100_m = []

for app in and_apps:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3603485.3884615386

In [35]:
for app in and_apps:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E