Our aim in this project is to find mobile app profiles that are profitable for the App Store and Google Play markets. We only wish to analyze apps that are free to download and install, and the usefulness of in-app ads as the main source of revenue. So, revenue for any given app is mostly influenced by the number of users that use the app. The goal for this project is to analyze data to help developers understand what kinds of apps are likely to attract more users.

Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

This project will use the following datasets, already available on Kaggle - 

[A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play

[A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store

Let's start by opening the two data sets and then continue with exploring the data.

In [65]:
fhand_1 = open('AppleStore.csv', encoding ='utf8')
fhand_2 = open('googleplaystore.csv', encoding ='utf8')
from csv import reader
ios_d = reader(fhand_1)
android_d = reader(fhand_2)
ios_raw = list(ios_d)
android_raw = list(android_d)

ios_header = ios_raw[0]
ios_data = ios_raw[1:]

android_header = android_raw[0]
android_data = android_raw[1:]

In [66]:
print(ios_header)
del(ios_header[0])     # to remove the leading index column
print(ios_header)
print(ios_data[:3])
for row in ios_data:
    del(row[0])

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
[['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']]


In [67]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
print('iOS AppStore Data Sample - ')
explore_data(ios_data, 0,3,True)
print('\nAndroid AppStore Data Sample - ')
explore_data(android_data, 0,3,True)


iOS AppStore Data Sample - 
['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 16

Android AppStore Data Sample - 
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '

Before starting with any analysis, we need to ensure the data is clean. The following steps will check if - 
1. Data has any incomplete rows : by matching the length of the all the rows to the header
2. Data has any duplicates : by creating unique & duplicate app names list
3. Data contains apps with non-English names : by doing ASCII comparision of app names

**Test 1 : Incompleteness check**

In [68]:
for row in android_data:
    if len(row) != len(android_header):
        print(row)
        print(android_data.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [69]:
del(android_data[10472])

In [70]:
print(len(android_data))

10840


In [71]:
for row in ios_data:
    if len(row) != len(ios_header):
        print(row)
        print(ios_data.index(row))

**Test 2 : Duplicates check**

In [72]:
unique_apps = []
duplicate_apps = []
for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps in android data:', len(duplicate_apps))
print('Sample duplicate apps in android data:', duplicate_apps[:4])

unique_apps = []
duplicate_apps = []
for app in ios_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps in ios data:', len(duplicate_apps))
print('Sample duplicate apps in ios data:', duplicate_apps[:4])

Number of duplicate apps in android data: 1181
Sample duplicate apps in android data: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']
Number of duplicate apps in ios data: 0
Sample duplicate apps in ios data: []


In [73]:
reviews_max = {}
for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    

In [74]:
print('Length of the dataset with no duplicates - ', len(reviews_max))

Length of the dataset with no duplicates -  9659


In [75]:
android_data_dedup = []
already_added = []
for app in android_data:
    name = app[0]
    review = float(app[3])
    if (review == reviews_max[name]) and (name not in already_added):
        android_data_dedup.append(app)
        already_added.append(name)

In [76]:
explore_data(android_data_dedup, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


**Test 3 : Only English names check**

In [77]:
def is_english(a_list):
    count = 0
    for bit in a_list:
        if ord(bit) > 127:
            count += 1
    if count >3:
        return False
    else:
        return True

android_data_eng=[]    
for app in android_data_dedup:
    if is_english(app[0]):
        android_data_eng.append(app)

print('New list length for Android data - ', len(android_data_eng))

New list length for Android data -  9614


In [78]:
ios_data_eng = []    
for app in ios_data:
    if is_english(app[1]):
        ios_data_eng.append(app)

print('New list length for iOS data - ', len(ios_data_eng))    


New list length for iOS data -  6183


At the start of the analysis, we decided to do this excercise only for free apps, so we will now filter out apps that are not free to install and use from both the datasets.

In [79]:
android_free = []
ios_free = []

for row in android_data_eng:
    price = row[7]
    if price == '0':
        android_free.append(row)


for row in ios_data_eng:
    price = row[4]
    if price == '0':
        ios_free.append(row)
    
print('Android list length - ', len(android_free))
print('iOS list length - ', len(ios_free))


            

Android list length -  8864
iOS list length -  3222


Now that we have clean data, we can proceed to see how the apps are categorized in both the data sets and which categories or genres are more popular than others. For this, we will create a frequency table or dictionary to look at the percentage usage of these categories. 

We have to identify different columns for the analysis from both data sets due to column differences.

In [80]:
def freq_dict(a_list, index):
    a_dict = {}
    total = 0
    for row in a_list:
        total +=1                               #same as list length
        key_col = row[index]
        if key_col in a_dict:
            a_dict[key_col] += 1
        else:
            a_dict[key_col] = 1
    
    percent_dict = {}
    for key in a_dict:
        percent_dict[key] = (a_dict[key] / total) * 100
    return percent_dict

def display_table(dataset, index):
    table = freq_dict(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print('iOS List - Prime Genre:')
display_table(ios_free, -5)

print('\nAndroid List - Category:')
display_table(android_free, 1)

iOS List - Prime Genre:
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665

Android List - Category:
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758

Although percentage presence is a good indicator of the availability of apps in a category, it does not guarantee sustainability in terms of continued usage. So we will look at how these categories/genres are doing in terms of average number of user ratings and installations.

In [81]:
ios_genre_count = freq_dict(ios_free, -5)
print('For iOS data - \nGenre : Avg number of ratings\n')
for genre in ios_genre_count:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    avg_rating = total/len_genre
    
    print(genre,' : ', avg_rating)       
    

For iOS data - 
Genre : Avg number of ratings

Productivity  :  21028.410714285714
Weather  :  52279.892857142855
Shopping  :  26919.690476190477
Reference  :  74942.11111111111
Finance  :  31467.944444444445
Music  :  57326.530303030304
Utilities  :  18684.456790123455
Travel  :  28243.8
Social Networking  :  71548.34905660378
Sports  :  23008.898550724636
Health & Fitness  :  23298.015384615384
Games  :  22788.6696905016
Food & Drink  :  33333.92307692308
News  :  21248.023255813954
Book  :  39758.5
Photo & Video  :  28441.54375
Entertainment  :  14029.830708661417
Business  :  7491.117647058823
Lifestyle  :  16485.764705882353
Education  :  7003.983050847458
Navigation  :  86090.33333333333
Medical  :  612.0
Catalogs  :  4004.0


In [82]:
android_installs_count = freq_dict(android_free, 1)
print('For Android data - \nCategory : Avg number of installations\n')

for category in android_installs_count:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            total += installs
            len_category += 1
    avg_installs = total / len_category
    print(category,' : ', avg_installs)

For Android data - 
Category : Avg number of installations

ART_AND_DESIGN  :  1986335.0877192982
AUTO_AND_VEHICLES  :  647317.8170731707
BEAUTY  :  513151.88679245283
BOOKS_AND_REFERENCE  :  8767811.894736841
BUSINESS  :  1712290.1474201474
COMICS  :  817657.2727272727
COMMUNICATION  :  38456119.167247385
DATING  :  854028.8303030303
EDUCATION  :  1833495.145631068
ENTERTAINMENT  :  11640705.88235294
EVENTS  :  253542.22222222222
FINANCE  :  1387692.475609756
FOOD_AND_DRINK  :  1924897.7363636363
HEALTH_AND_FITNESS  :  4188821.9853479853
HOUSE_AND_HOME  :  1331540.5616438356
LIBRARIES_AND_DEMO  :  638503.734939759
LIFESTYLE  :  1437816.2687861272
GAME  :  15588015.603248259
FAMILY  :  3695641.8198090694
MEDICAL  :  120550.61980830671
SOCIAL  :  23253652.127118643
SHOPPING  :  7036877.311557789
PHOTOGRAPHY  :  17840110.40229885
SPORTS  :  3638640.1428571427
TRAVEL_AND_LOCAL  :  13984077.710144928
TOOLS  :  10801391.298666667
PERSONALIZATION  :  5201482.6122448975
PRODUCTIVITY  :  16787

**Conclusion**
We looked at the various installation and rating averages and at the surface, it looks like Navigation & Communication dominate the market. However, if we dig deeper, we realize these fantastic numbers are due to the tremendous presence of market leader apps that skew the averages, for example - Google Maps, Waze, Whatsapp etc. 
It is highly likely that the best genre to create an app in is the one that is underserved currently and still has an respectable number of consumers such as Books and Reference.