# **App Profile Recommendation**

*We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.*

In [6]:
from csv import reader
with open('AppRecommendation/AppleStore.csv') as iosFile:
    iosReader = reader(iosFile)
    ios = list(iosReader)
    iosHeader = ios[0]
    ios = ios[1:]
    
with open('AppRecommendation/googleplaystore.csv') as androidFile:
    androidReader = reader(androidFile)
    android = list(androidReader)
    androidHeader = android[0]
    android = android[1:]

In [7]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Ios data set documentation:
https://dq-content.s3.amazonaws.com/350/googleplaystore.csv

In [8]:
print(iosHeader)
print('\n')
explore_data(ios, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


Android data set documentation:
https://dq-content.s3.amazonaws.com/350/AppleStore.csv

In [9]:
print(androidHeader)
print('\n')
explore_data(android, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

In [10]:
print(android[10472]) #wrong entry
del android[10472]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


# Duplicates

Google play store dataset has multiple entries of the same app. For instance,

In [11]:
for entry in android:
    app = entry[0]
    if app == "Instagram":
        print(entry, '\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 



In [12]:
unique = []
duplicates = []
for entry in android:
    app = entry[0]
    if app not in unique:
        unique.append(app)
    else:
        duplicates.append(app)
print("Number of duplicates:",len(duplicates))

Number of duplicates: 1181


Duplicates will not be removed randomly. Instead, the row with highest number of reviews would be kept as that would imply that it is the latest entry.

**Putting highest no. of reviews of each app in a dictionary**
* Checking if name is in the dictionary as a key and if it has more no. of reviews than the dictionary value to *update the no. of reviews in the dictionary.*
* Checking if name is not in the dictionary as a key to *add the name as the key and no. of reviews as the value in the dictionay.*

In [13]:
reviews_max = {}
for entry in android:
    name = entry[0]
    n_reviews = float(entry[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        n_reviews = reviews_max
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print("Length of dictionary:",len(reviews_max))

Length of dictionary: 9659



**Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:**

* We start by initializing two empty lists, android_clean and already_added.
* We loop through the android data set, and for every iteration:
    * We isolate the name of the app and the number of reviews.
    * We add the current row (app) to the android_clean list, and the app name (name) to the already_added list if:
        * The number of reviews of the current app matches the number of reviews of that app as described in the reviews_max             dictionary; and
        * The name of the app is not already in the already_added list. We need to add this supplementary condition to account           for those cases where the highest number of reviews of a duplicate app is the same for more than one entry.


In [14]:
android_clean = []
already_added = []
for entry in android:
    name = entry[0]
    n_reviews = float(entry[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(entry)
        already_added.append(name)
print("Length of clean dataset:",len(android_clean))

Length of clean dataset: 9659


# Detecting non-English apps

In [19]:
def nonEnglishApps(inputString):
    countSpecial = 0
    for character in inputString:
        if ord(character) > 127:
            countSpecial += 1
    if countSpecial > 3:
        return False
    else:
        return True

nonEnglishApps('Instachat 😜')

True

In [20]:
ios_English = []
android_English = []

for entry in ios:
    name = entry[1]
    if nonEnglishApps(name):
        ios_English.append(entry)
        
for entry in android_clean:
    name = entry[0]
    if nonEnglishApps(name):
        android_English.append(entry)
        
explore_data(ios_English, 0, 3, True)
print("\n")
explore_data(android_English, 0, 3, True)

        


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1,

# Isolate Free Apps

In [29]:
ios_free = []
android_free = []
for entry in ios_English:
    price = entry[4]
    if price == "0.0":
        ios_free.append(entry)
        
for entry in android_English:
    price = entry[7]
    if price == "0" :
        android_free.append(entry)
print("Ios Length:")        
explore_data(ios_free,0,0,True)
print("\nAndroid Length:")
explore_data(android_free,0,0,True)

Ios Length:
Number of rows: 3222
Number of columns: 16

Android Length:
Number of rows: 8862
Number of columns: 13


**To minimize risks and overhead, our validation strategy for an app idea has three steps:**
* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. 

In [30]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for entry in dataset:
        total+=1
        value = entry[index]
        if value not in table:
            table[value] = 1
        else:
            table[value] += 1
            
    table_perc = {}
    for key in table:
        percentage = (table[key]/total)*100
        table_perc[key] = percentage
    return table_perc
    


In [31]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [32]:
display_table(ios_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


* Most common genre: Games, Next most common genre: Entertainment
* More apps for entertainment purposes

**Category column**

In [40]:
display_table(android_free, 1)

FAMILY : 18.449559918754233
GAME : 9.873617693522906
TOOLS : 8.440532611148726
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.6770480704129994
PARENTING : 0.65

**Genres column:**

In [41]:
display_table(android_free, -4)

Tools : 8.429248476641842
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7603249830737984
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

In [42]:
ios_final = ios_free
android_final = android_free

# Most popular apps in Appstore

In [49]:
iosGenre = freq_table(ios_final, 11)
for genre in iosGenre:
    total = 0
    len_genre = 0
    for entry in ios_final:
        genre_app = entry[11]
        if genre_app == genre:
            userRating = float(entry[5])
            total += userRating
            len_genre += 1
    avg = total/len_genre
    print(genre, avg)
            
            

Social Networking 71548.34905660378
Photo & Video 28441.54375
Games 22788.6696905016
Music 57326.530303030304
Reference 74942.11111111111
Health & Fitness 23298.015384615384
Weather 52279.892857142855
Utilities 18684.456790123455
Travel 28243.8
Shopping 26919.690476190477
News 21248.023255813954
Navigation 86090.33333333333
Lifestyle 16485.764705882353
Entertainment 14029.830708661417
Food & Drink 33333.92307692308
Sports 23008.898550724636
Book 39758.5
Finance 31467.944444444445
Education 7003.983050847458
Productivity 21028.410714285714
Business 7491.117647058823
Catalogs 4004.0
Medical 612.0


Navigation apps have highest no. of reviews.

In [47]:
print(iosHeader)
for entry in ios_final:
    genre = entry[11]
    if genre == "Navigation":
        print(entry[1], ":", entry[5])   

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


No. of ratings is highly influenced by few apps which have a large concentration of ratings while others struggle. This makes navigation seem popular than it is.

In [65]:
androidCategory = freq_table(android_final, 1)
for category in androidCategory:
    total = 0
    len_category = 0
    for entry in android_final:
        category_app = entry[1]
        if category_app == category:
            installs = entry[5]
            installs = installs.replace("+","")
            installs = float(installs.replace(",",""))
            total+=installs
            len_category += 1
    avg = total/len_category
    print(category,":",avg)
            
            

ART_AND_DESIGN : 1905351.6666666667
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 3082017.543859649
ENTERTAINMENT : 21134600.0
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1313681.9054054054
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15837565.085714286
FAMILY : 2691618.159021407
MEDICAL : 120616.48717948717
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17805627.643678162
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10695245.286096256
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24852732.40506329
NEWS_AND_MAGAZINE