# Profitable App Profiles for the App Store and Google Play Markets

This project is the first major project of the "Data Analysis in Python" course. The project incorporates the major concepts of:
- The basics of programming in Python (arithmetical operations, variables, common data types, etc.)
- List and for loops
- Conditional statements
- Dictionaries and frequency tables
- Functions

The goal of this analysis is to help developers understand the type of apps that attract more users on Google Play and the App Store. 

# Opening and Exploring the Data 

In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding= 'utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ### 
opened_file = open('AppleStore.csv', encoding="utf8")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

The explore_data() function takes the datasets we created and transforms them to be more readable. 

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False): #DATASET parameter should not include header!
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
explore_data(android,0,3,rows_and_columns = True)
print('\n')
print(ios_header)
explore_data(ios,0,3,rows_and_columns = True)



['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['281656475',

# Deleting Wrong Data

In [6]:
#Google Data Set Data Cleaning

#https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015 - example of bad data 
#Checks if any row length does not match length of header 

for row in android:
    if len(android_header) != len(row):
        print(android_header)
        print(len(android_header))
        print(row)
        print(len(row))
        print("Index postion is:", android.index(row))
        


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
13
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
Index postion is: 10472


In [7]:
# Delete bad data from Google dataset
#Only run once

print(len(android))
del android[10472] #do not run again! Ran at 12:34pm 9/12/21
print(len(android))

10841
10840


# Removing Duplicate Entries


## Part One

There are some instances in the Google dataset where an app appears more than once

In [11]:
unique_apps = []
dup_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        dup_apps.append(name)
    else:
        unique_apps.append(name)

print('Instagram' in dup_apps)
print('Number of dup apps', len(dup_apps))
print('\n')
print('Examples of unique apps', unique_apps[0:7])

True
Number of dup apps 1181


Examples of unique apps ['Photo Editor & Candy Camera & Grid & ScrapBook', 'Coloring book moana', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book', 'Paper flowers instructions', 'Smoke Effect Photo Maker - Smoke Editor']


In [9]:
#Check if app name Instagram  is duplicated in Google dataset

for app in android: 
    name = app[0]
    if name == 'Instagram':
        print(app)
            


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


To only keep clean rows, we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings

To do that, we will:
- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

## Part 2

In [13]:
#Only want to have unique app names in dictionary with highest (most recent) number of reviews 

reviews_max = {}

for row in android:
    name = row[0] # Name of app
    n_reviews = float(row[3]) # Number of reviews
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max)) #Should be 9659

#to check if code is correct, search for Instagram and see count of reviews. 66577446.0 is most recent number 
print(reviews_max['Instagram'])


9659
66577446.0


In Part One, we found that there are 1181 instances of duplicate apps. The length of our dictionary should match the length of our dataset minus 1181

In [15]:
print("Expected Length", len(android) - 1181)
print("Length of reviews_max dictionary", len(reviews_max))

Expected Length 9659
Length of reviews_max dictionary 9659


The dictionary we previously created has the mapping of the unique app name and its max number of reviews. The below code:
- Runs through each row in the Android dataset
- Takes the row of the unique app name and its max review number and matches it with the dictionary. If they match, that row is added to the android_clean new list. 

In [17]:
#Remove duplicates from Google

android_clean = [] #Stores new clean data
already_added = [] #Just stores app names

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure this is inside the if block
        
print(android_clean[0:10])    

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000+', '

In [8]:
#Double check that android_clean has 9659 records
explore_data(android_clean, 0, 3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


# Removing Non English Apps

## Part One

Some apps in our dataset might have non english characters. The function below checks if any character in the string has a non english character.
We use the American Standard Code for Information Interchange to check each character value using the ord() function.

In [9]:
#Write a function to check if a string contains non English characters 

def non_english_check(app_name):
    for character in app_name:
        if ord(character) > 127:
            return False
        else:
            return True
     
    
non_english_check('Instagram')
non_english_check('爱奇艺PPS -《欢乐颂2》电视剧热播')
non_english_check('Instachat 😜') #The emoji is not being picked up and thus I'm getting True. This is wrong


True

## Part 2
To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters. We will edit the above function

In [19]:

def non_english_check_new(app_name):
    count_outside_ascii = 0
    for character in app_name:
        if ord(character) > 127:
            count_outside_ascii += 1
            
    if count_outside_ascii > 3:
        return False
    else:
        return True
    

print(non_english_check_new('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(non_english_check_new('Instachat 😜'))
print(non_english_check_new('Docs To Go™ Free Office Suite'))

False
True
True


We will filter any non English apps from both of our main datasets. Create lists from both data sets where the app name is ENGLISH using above function. If an app name is identified as English, append the whole row to a separate list.

In [23]:
android_english_app = []
ios_english_app = []

for row in android_clean:
    app = row[0]
    non_english_check_new(app)
    if  non_english_check_new(app):
        android_english_app.append(row)
        

for row in ios:
    app = row[1]
    non_english_check_new(app)
    if  non_english_check_new(app):
        ios_english_app.append(row)
        
explore_data(android_english_app, 0, 3, True)
print('\n')
explore_data(ios_english_app, 0, 3, True)
        
        

    
    
    


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '

# Isolating the free apps

 We will now Isolate free apps in separate lists for both data sets 


In [25]:
android_free_app = []
android_paid_app = []
ios_free_app = []
ios_paid_app = []

for row in android_english_app:
    name = row[0]
    price_type = row[7]
    if price_type == '0':
        android_free_app.append(row)
    else:
        android_paid_app.append(row)
        
print(len(android_free_app))

for row in ios_english_app:
    name = row[1]
    price_type = row[4]
    if price_type == '0':
        ios_free_app.append(row)
    else:
        ios_paid_app.append(row)

print(len(ios_free_app))
    
        


8864
3222


# Most Common Apps by Genre 
## Part One

As we mentioned before, our want is to determine the kinds of apps that are likely to attract more users. Revenue is highly influenced by the number of people using our apps.

The validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we then develop it further.
- If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.

## Part Two

Two functions can be used to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

In [29]:
# freq table function

def freq_table(dataset, index):
    
    android_freq_table = {}
    total = 0

    for row in dataset:
        total += 1
        value = row[index]
        if value in android_freq_table:
            android_freq_table[value] += 1
        else:
            android_freq_table[value] = 1
            
    table_percentages = {}
    for key in android_freq_table:
        percentage = (android_freq_table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

#display table function

def display_table(dataset, index):
    table = freq_table(dataset, index) #remembers result of freq_table 
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
            
print('IOS Prime Genre')
display_table(ios_free_app, 11) # -5 index equivalent to index 11
print('\n')
print('Google Category')
display_table(android_free_app, 1 ) #Index 1 is Category
print('\n')
print('Google Genre')
display_table(android_free_app, 9 ) #Index 9 is Genres


IOS Prime Genre
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Google Category
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSON

For IOS:
- The most common popular is GAMES, followed by ENTERTAINMENT
- The top five genres are all for ENTERTAINMENT purposes. The 6th highest genre is shopping, which has practical purpose
- The top five genres are centered around fun


For Google:
- The most common genre is TOOLS, and most common category is FAMILY
- The top 5 genres and categories seem to be a mix of entertainment and utility related
- Entertainment is found in both Apps 

Based on our comparisons, it would seem that the App Store is dominated by fun categories, while the Google store is more mixed. 

# Most Popular Apps by Genre on the App Store

Below, we calculate the average number of user ratings per app genre on the App Store:

In [31]:
ios_genre = freq_table(ios_free_app, 11)

for genre in ios_genre:
    total = 0 # sum of user ratings per genre
    len_genre = 0 # number of apps specific to each genre
    for app in ios_free_app:
        genre_app = app[11]
        if genre_app == genre:
            user_ratings = float(app[5])
            total +=  user_ratings
            len_genre += 1
    average_number_usr = total / len_genre
    print(genre, ',', average_number_usr)
    

Productivity , 21028.410714285714
Weather , 52279.892857142855
Shopping , 26919.690476190477
Reference , 74942.11111111111
Finance , 31467.944444444445
Music , 57326.530303030304
Utilities , 18684.456790123455
Travel , 28243.8
Social Networking , 71548.34905660378
Sports , 23008.898550724636
Health & Fitness , 23298.015384615384
Games , 22788.6696905016
Food & Drink , 33333.92307692308
News , 21248.023255813954
Book , 39758.5
Photo & Video , 28441.54375
Entertainment , 14029.830708661417
Business , 7491.117647058823
Lifestyle , 16485.764705882353
Education , 7003.983050847458
Navigation , 86090.33333333333
Medical , 612.0
Catalogs , 4004.0


NAVIGATION apps seems to have the most average number of user reviews

# Most Popular Apps by Genre on Google Play
The Google datset already has number of installs, but the values aren't very precise. The function below removes extra characters and converts them into floats

In [33]:
unique_google_ag = freq_table(android_free_app, 1)

for category in unique_google_ag:
    total = 0 #store sum of installs per genre
    len_category = 0 #store number of apps specific to each genre
    for row in android_free_app:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total +=  float(n_installs)
            len_category += 1
            
    avg_installs  = (total / len_category)
    print(category, ',' , avg_installs)
            

            
    

ART_AND_DESIGN , 1986335.0877192982
AUTO_AND_VEHICLES , 647317.8170731707
BEAUTY , 513151.88679245283
BOOKS_AND_REFERENCE , 8767811.894736841
BUSINESS , 1712290.1474201474
COMICS , 817657.2727272727
COMMUNICATION , 38456119.167247385
DATING , 854028.8303030303
EDUCATION , 1833495.145631068
ENTERTAINMENT , 11640705.88235294
EVENTS , 253542.22222222222
FINANCE , 1387692.475609756
FOOD_AND_DRINK , 1924897.7363636363
HEALTH_AND_FITNESS , 4188821.9853479853
HOUSE_AND_HOME , 1331540.5616438356
LIBRARIES_AND_DEMO , 638503.734939759
LIFESTYLE , 1437816.2687861272
GAME , 15588015.603248259
FAMILY , 3695641.8198090694
MEDICAL , 120550.61980830671
SOCIAL , 23253652.127118643
SHOPPING , 7036877.311557789
PHOTOGRAPHY , 17840110.40229885
SPORTS , 3638640.1428571427
TRAVEL_AND_LOCAL , 13984077.710144928
TOOLS , 10801391.298666667
PERSONALIZATION , 5201482.6122448975
PRODUCTIVITY , 16787331.344927534
PARENTING , 542603.6206896552
WEATHER , 5074486.197183099
VIDEO_PLAYERS , 24727872.452830188
NEWS_AND_

It seems that COMMUNICATION apps have the most average number of downloads

# Conclusion

The goal of this project was to analyze apps in both App and Google Play Stores. 