# Identifying App Profiles that Drive User Growth

- The goal of this project is to help software developers make data-driven decisions with respect to their app profile choice
- We are working as Data Analysts for a mobile development company whose business model rely substantially on ad revenues. We will analyze both App store and Play store apps to determine which App Traits will attract more users.

In [18]:
from csv import reader

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') 
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n')
        
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
IOS_Apps = list(read_file)

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
Android_Apps = list(read_file)

explore_data(IOS_Apps,0,3,rows_and_columns= True)
explore_data(Android_Apps,0,3,rows_and_columns= True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyo

---
 In the code above, we took a snapshot of the data set that we are working with. We can see that the two data sets have different column names.  We are now going to manipulate and combine the two and make a big singular set.

---

In [3]:
count=1
unique_apps= []
unique_index=[]
dup_apps=[]
dup_index=[]
for app in Android_Apps[1:]:
    name = app[0]
    if name in unique_apps:
        dup_apps.append(name)
        dup_index.append((name,count))
    else:
        unique_apps.append(name)
        unique_index.append((name,count))
    count += 1
    
print(dup_apps[0]) #fist item in dup list
print(dup_index[0]) #first item with index

print(unique_apps[0]) #first item in unique list
print(unique_index[0]) #first item with index

print("Total Duplicate Entries:",len(dup_apps), '\n')

#Checking first duplicate app's index on unique_apps
first_copy = unique_apps.index('Quick PDF Scanner + OCR FREE') #index of first duplicate on 
Index_AndroidApps = unique_index[first_copy][1]


print(Android_Apps[Index_AndroidApps])
print(Android_Apps[230])
    

Quick PDF Scanner + OCR FREE
('Quick PDF Scanner + OCR FREE', 230)
Photo Editor & Candy Camera & Grid & ScrapBook
('Photo Editor & Candy Camera & Grid & ScrapBook', 1)
Total Duplicate Entries: 1181 

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


---

Above, I classified the apps on the Play Store into its unique data entries and duplicate entries. It showed that there are indead a lot of duplicates in the data set. With that in mind, we will first get rid of duplicates before continuing with the analysis. Note that we will retain data entries with the most user reviews.

---

In [14]:

reviews_max = {}
for app in Android_Apps[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] > n_reviews  :
        reviews_max[name] = n_reviews
        
    if name not in reviews_max:
        reviews_max[name] = n_reviews
#We have now created a dictionary that contains the unique entries   
#with the most user reviews. It is time to create the cleaned data

android_clean = []
already_added = []
#loop through Play Store data set to obtain unique entries with max user rating
for app in Android_Apps[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
    
        
print(len(android_clean))
    

9658


---
We have now obtained the unique data entries of our data sets. We first made a list of all the unique names before using that list to loop through the Androids Apps to obtain each unique row.

---

In [21]:
def is_english(string): #checks if string has more than 2 characters not belonging in the commonly used english charactes.
    count=0
    for character in string:
        if ord(character) > 127:
            count += 1
            if count == 3:
                return False
    return True
# print(is_english('Docs To Go™ Free Office Suite')) #tests
# print(is_english('Instachat 😜'))
# print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
english_IOS= []
english_AND= []
for app in IOS_Apps[1:]: #loopinng through IOS data set
    name = app[1]
    if is_english(name):
        english_IOS.append(app)
for app in Android_Apps[1:]: #looping through Android data set
    name = app[0]
    if is_english(name):
        english_AND.append(app)
print('Total English IOS Apps: ', len(english_IOS))
print('Total English Android Apps: ',len(english_AND))

Total English IOS Apps:  6155
Total English Android Apps:  10779


---
Here,  we made a function that checks if a string contains all common english characters. With that, we were able to sift out non-english apps for our analysis. 

---

In [22]:
#price index in IOS 4, Android 7
free_IOS = []
free_Android = []
for app in english_IOS:
    price = app[4]
    if price == '0.0':
        free_IOS.append(app)

for app in english_AND:
    price = app[7]
    if price == '0':
        free_Android.append(app)
print('Total English and Free IOS Apps: ',len(free_IOS))
print('Total English and Free Android Apps: ', len(free_Android))

Total English and Free IOS Apps:  3203
Total English and Free Android Apps:  9983


Lastly, we were able to clean our data of non-free apps.

---

### So far, we have:
- Removed inaccurate data
- Removed duplicated app entries
- Removed non-English apps
- Removed Non-Free Apps
We will now proceed with the analysis.

**Again, the goal of this analysis will be to find App traits that produce success in both markets in terms of the number of users.** We start by looking at the columns and seeing what could be relevant for our purposes. On the IOS dataset we have the prime_genre column and on the Android dataset we have the category and genre column. 

In the code below, we will build frequency tables for the said columns. We will create functions to make this repetitive work faster.

In [31]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
#On IOS dataset, prime_genre is at index 11
#On Android dataset, Genres is at index 1 and Category is at index 9



In [32]:
display_table(free_IOS, 11) #Free IOS Apps, freq_table of prime_gere


Games : 58.25788323446769
Entertainment : 7.836403371838902
Photo & Video : 4.995316890415236
Education : 3.6840462066812365
Social Networking : 3.3093974399000934
Shopping : 2.5913206369029034
Utilities : 2.466437714642523
Sports : 2.1542304089915705
Music : 2.0605682172962845
Health & Fitness : 2.0293474867311896
Productivity : 1.7483609116453322
Lifestyle : 1.5610365282547611
News : 1.3424914142990947
Travel : 1.248829222603809
Finance : 1.0927255697783327
Weather : 0.8741804558226661
Food & Drink : 0.8117389946924758
Reference : 0.5307524196066188
Business : 0.5307524196066188
Book : 0.3746487667811427
Navigation : 0.18732438339057134
Medical : 0.18732438339057134
Catalogs : 0.1248829222603809


On English & Free IOS Apps, we can see that Games dominate the Genres. Entertainment and Photo & Video though runner-ups are dwarfed by Games. Apps for practical purposes also take a considerable amount of the genre (i.e Education, Shopping, Utilities, Productivity & Lifestyle).

I think it is fair to say that for English and Free apps, it most crowded in Games which most likely have the most users. We will have to investigate further on this.

---

In [33]:
display_table(free_Android, 1)

FAMILY : 17.700090153260543
GAME : 10.567965541420413
TOOLS : 7.632976059300811
BUSINESS : 4.45757788240008
PRODUCTIVITY : 3.956726434939397
SPORTS : 3.5961133927677054
COMMUNICATION : 3.5860963638184917
LIFESTYLE : 3.5760793348692776
MEDICAL : 3.5460282480216367
FINANCE : 3.4959431032755686
HEALTH_AND_FITNESS : 3.2555344084944404
PHOTOGRAPHY : 3.1253130321546627
PERSONALIZATION : 3.085244916357808
SOCIAL : 2.9249724531703896
NEWS_AND_MAGAZINES : 2.7747170189321846
SHOPPING : 2.5743764399479114
TRAVEL_AND_LOCAL : 2.464189121506561
DATING : 2.2738655714715015
BOOKS_AND_REFERENCE : 1.9833717319443052
VIDEO_PLAYERS : 1.7028949213663227
EDUCATION : 1.5125713713312632
ENTERTAINMENT : 1.4725032555344086
MAPS_AND_NAVIGATION : 1.2921967344485625
FOOD_AND_DRINK : 1.252128618651708
HOUSE_AND_HOME : 0.8614644896323751
LIBRARIES_AND_DEMO : 0.8414304317339477
AUTO_AND_VEHICLES : 0.8213963738355204
WEATHER : 0.7312431132925974
EVENTS : 0.6310728238004608
ART_AND_DESIGN : 0.6110387659020335
PARENTING

On English and Free Android Apps, the Categories is dominated by FAMILY, GAME and TOOLS. It also shows the same trend in IOS apps that practical apps also take a considerable amount of the pie of apps. The trend of Entertainments and Family apps dominating the Android Category is the same in IOS apps.

---

In [35]:
display_table(free_Android,9)

Tools : 7.6229590303515975
Entertainment : 6.010217369528198
Education : 5.13873585094661
Business : 4.45757788240008
Productivity : 3.956726434939397
Sports : 3.7363517980566967
Communication : 3.5860963638184917
Lifestyle : 3.566062305920064
Medical : 3.5460282480216367
Finance : 3.4959431032755686
Action : 3.4057898427326454
Health & Fitness : 3.2555344084944404
Photography : 3.1253130321546627
Personalization : 3.085244916357808
Social : 2.9249724531703896
News & Magazines : 2.7747170189321846
Shopping : 2.5743764399479114
Travel & Local : 2.4541720925573474
Dating : 2.2738655714715015
Arcade : 1.9933887608935192
Books & Reference : 1.9833717319443052
Simulation : 1.8832014424521686
Casual : 1.8431333266553143
Video Players & Editors : 1.6828608634678954
Maps & Navigation : 1.2921967344485625
Food & Drink : 1.252128618651708
Puzzle : 1.2120605028548532
Racing : 0.951617750175298
Strategy : 0.9315836922768707
Role Playing : 0.8714815185815887
House & Home : 0.8614644896323751
Librar

The trend is similar on the Genres for Free and English Android Apps. Entertainment take a big chunk while practical apps also take a considerable amount. So far, data has been showing that most apps belong to the games and/or entertainment category (fun apps) and that practical apps takes the runner up spot.

---

We will now compute for the average rating of each genre/category in our datasets.

In [44]:
genre_ios= freq_table(free_IOS, 11)

for genre in genre_ios:
    total= 0
    len_genre= 0
    for app in free_IOS:
        genre_app = app[11]
        if genre_app == genre:
            rating= float(app[5])
            total += rating
            len_genre += 1
    avg_rating= total/len_genre
    print(genre,':' ,avg_rating)
    

Sports : 23008.898550724636
Games : 22886.36709539121
Education : 7003.983050847458
Weather : 52279.892857142855
Business : 7491.117647058823
Catalogs : 4004.0
Navigation : 86090.33333333333
Travel : 28243.8
Social Networking : 71548.34905660378
Book : 46384.916666666664
Health & Fitness : 23298.015384615384
Productivity : 21028.410714285714
News : 21248.023255813954
Reference : 79350.4705882353
Music : 57326.530303030304
Shopping : 27230.734939759037
Entertainment : 14195.358565737051
Utilities : 19156.493670886077
Finance : 32367.02857142857
Photo & Video : 28441.54375
Medical : 612.0
Food & Drink : 33333.92307692308
Lifestyle : 16815.48


The data is showing that the categories Social Networking and Reference have the highest user reviews for IOS apps.  My first recommendation would be to focus on building IOS apps that fall under the intersection of Social Networking and Reference. It will already have a significant amount of users actively giving reviews and will therefore increase the chance of grabbing some exposure in the space.

---

In [62]:
genres_android = freq_table(free_Android,9)
tot_avg_rating =[]
for category in genres_android:
    total= 0
    len_category= 0
    for app in free_Android:
        category_app = app[9]
        if category_app == category:
            num_install=app[5]
            num_1 = num_install.replace('+','')
            num_2 = num_1.replace(',','')
            total += float(num_2)
            len_category += 1
    avg_rating= total/len_category
    tot_avg_rating.append((avg_rating, category))
    print(category,':' ,avg_rating)
print('\n')
tot_avg_rating = sorted(tot_avg_rating, reverse=True)

print('Top Two Genres with the most Installs:')
print(tot_avg_rating[0],'\n',tot_avg_rating[1])

Entertainment;Brain Games : 4150000.0
Education;Pretend Play : 2000000.0
Casual : 52515209.61956522
Simulation;Pretend Play : 700000.0
Business : 2250454.1348314607
Strategy;Action & Adventure : 1000000.0
Casual;Creativity : 6000000.0
Racing;Pretend Play : 1000000.0
Art & Design;Action & Adventure : 100000.0
Puzzle : 13770960.256198347
Comics : 982739.4736842106
Role Playing;Pretend Play : 4240000.0
Art & Design : 2268724.074074074
Board;Brain Games : 981250.0
Comics;Creativity : 50000.0
Video Players & Editors;Creativity : 5000000.0
Entertainment;Education : 1000000.0
Maps & Navigation : 5574114.573643411
Travel & Local;Action & Adventure : 100000.0
Adventure;Action & Adventure : 82363636.36363636
Racing : 21054071.789473683
Finance : 2511355.6790830945
Puzzle;Brain Games : 9071176.470588235
Strategy : 21152603.279569894
Lifestyle : 1464321.4297752809
Personalization : 7533233.402597402
Board : 4766088.857142857
Educational;Action & Adventure : 25262500.0
Music;Music & Video : 5050000

On thethe Play Store, we reccommend building an app that falls under the intersection of Communication and Action & Adventure for the same reasons.

---

# In Conclusion
Take advantage of the big user bases that **Social Networking/Communications and Adventure games** have. A basic idea to take advantage of this would be to create an Adventure game that heavily **incentivises Player to Player interaction**. The downside to this approach would be if these top genres are overcrowded to the point that it is almost impossible to standout. 

A good compromise or balance would be trying to hit or **incorporate the middle of the pack genres** in terms of user reviews. This way, you decrease the chance for overcrowding but at the same time still have a significant enough user base unlike the bottom Genres in terms of user reviews.

