# Analysis of free apps in the Google Play store and App Store

* This project will analyze data of free Android and Apple apps
* Goal of the project- to help developers understand what types of apps are likely to attract more users and which will generate the most advertising revenue. The data analysis for this project will be used to create a strategy for developing our own app.

Note: This is a guided project from Step 1 of the Data Analyst in Python course from www.dataquest.io

In [1]:
openapplestorefile = open('AppleStore.csv')
opengooglestorefile = open('googleplaystore.csv')

from csv import reader
apple_read_file = reader(openapplestorefile)
apple_apps_data = list(apple_read_file)

from csv import reader
google_read_file = reader(opengooglestorefile)
google_apps_data = list(google_read_file)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_apps_data, 0, 4, rows_and_columns=True)
print('\n')
explore_data(google_apps_data, 0, 4, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 

# Deleting inaccurate data


There is 1 row in the Google Play Store dataset that is missing information. The following code deletes that row.

In [2]:
print(len(google_apps_data))
del google_apps_data[10473]
print(len(google_apps_data))

10842
10841


The following code checks to see if the App Store dataset has any rows whose length deviates from the header row. The code does not return any rows, so we know that there is no missing data.

In [3]:
apple_header = apple_apps_data[0]

for row in apple_apps_data:
    if len(row) != len(apple_header):
        print(row)

# Deleting duplicate data

The Google Play store data has duplicate data. Below is a sample of some of the duplicate rows found in the dataset.

In [4]:
duplicate_apps = []
unique_apps = []

for row in google_apps_data[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Number of unique apps: ', len(unique_apps))
print('\n')
print('Sample of duplicate data: ', duplicate_apps[0:4])
print('\n')
print('Expected length of dataset with duplicates removed: ', (len(unique_apps) - 1181))

Number of duplicate apps:  1181


Number of unique apps:  9659


Sample of duplicate data:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']


Expected length of dataset with duplicates removed:  8478


The duplicate data will not be deleted randomly. The apps data with the highest number of user reviews will be kept and the remaining apps data deleted. This allows us to keep the most up-to-date data in our dataset.

The code below creates a dictionary of the highest amounts of user reviews for each unique app in the dataset.

In [5]:
reviews_max = {}

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    
print('The length of remove_max dictionary is: ', len(reviews_max))

The length of remove_max dictionary is:  9659


The code below identifies the data for each app in the dataset that contains the highest number of reviews. The entire row for the data with the highest number of reviews is added to the android_clean list to create a list of lists. Then the name of the each app from android_clean is added to already_added. This eliminates duplicate data from our dataset.

In [6]:
android_clean = []
already_added = []

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print('Sample of android_clean list: ', android_clean[0:4])
print('\n')
print('Sample of already_added list: ', already_added[0:5])
print('\n')
print('The length of android_clean list is: ', len(android_clean))

Sample of android_clean list:  [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


Sample of already_added list:  ['Photo Editor & Candy Camera & Grid & ScrapBook', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book', 'Paper flowers instruct

For the purpose of these datasets, we are only interested in identifying apps whose names are written in English. In the code below, we use a loop to identify if the characters in a string are in English based on their assigned ASCII numbers.

In [7]:
def special_characters(string):
    number_special_characters = 0
    for character in string:
        if ord(character) > 127:
            number_special_characters += 1
    
    if number_special_characters > 3:
        return False
    else:
        return True
      
print('Is Instagram in English?: ', special_characters('Instagram'))
print('Is 爱奇艺PPS -《欢乐颂2》电视剧热播 in English?: ', special_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('Is Docs To Go™ Free Office Suite in English?: ', special_characters('Docs To Go™ Free Office Suite'))
#print('Is Instachat 😜 in English?: ', special_characters('Instachat 😜')


Is Instagram in English?:  True
Is 爱奇艺PPS -《欢乐颂2》电视剧热播 in English?:  False
Is Docs To Go™ Free Office Suite in English?:  True


For the purpose of this dataset, we have decided that we will only remove apps from our datasets that have more than 3 non-English characters in the app's name. In the code below, we are identifying which app names have more than 3 non-English characters and are separating the data into 2 lists. This will be performed for both the Apple and Android datasets.

In [8]:
english_android_clean = []
non_english_android_clean = []

for row in android_clean:
    name = row[0]
    if special_characters(name) == False:
        non_english_android_clean.append(row)
    else:
        english_android_clean.append(row)

english_apple = []
non_english_apple = []

for row in apple_apps_data[1:]:
    name = row[1]
    if special_characters(name) == False:
        non_english_apple.append(row)
    else:
        english_apple.append(row)
      
print('Length of English Android apps list: ', len(english_android_clean))
print('Length of English Apple apps list: ',len(english_apple))
print('Length of non-English Android apps list: ', len(non_english_android_clean))
print('Length of non-English Apple apps list: ',len(non_english_apple))


Length of English Android apps list:  9614
Length of English Apple apps list:  6183
Length of non-English Android apps list:  45
Length of non-English Apple apps list:  1014


We are interested in identifying which apps are free and which are paid. The code below separates the apps that are free from each dataset. We now have our final lists of apps whose data we will analyze. 

In [9]:
apple_apps_final = []
android_apps_final = []

for row in english_apple:
    price = float(row[4])
    if price == 0.0:
        apple_apps_final.append(row)

for row in english_android_clean:
    price = row[7]
    if price == '0':
        android_apps_final.append(row)
        
print('Length of Free Apple apps list: ', len(apple_apps_final))
print('Length of Free Android apps list: ', len(android_apps_final))
print('Length of paid Apple apps: ', 6183 - len(apple_apps_final))
print('Length of paid Android apps: ', 9614 - len(android_apps_final))

print('Sample of final Apple data: ', apple_apps_final[0:3])
print('Sample of final Android data: ', android_apps_final[0:4])


Length of Free Apple apps list:  3222
Length of Free Android apps list:  8864
Length of paid Apple apps:  2961
Length of paid Android apps:  750
Sample of final Apple data:  [['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]
Sample of final Android data:  [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '

# App development strategy

* Now that we are done with data cleaning, we will move on to focus on analyzing the data from each dataset. We are analyzing the app data to create a strategy to use while creating our own app. Our strategy will be based on what the data shows for the apps in each dataset- we are looking for what works and does not work. We want to develop our app based on apps that are successful in both the Android and iOS markets.
* The goals for our app development are:
    1. Build a minimal Android version of our app and add it to the Google Play Store.
    2. If the app has a good response from users, we will develop the app further.
    3. If the app is profitable after 6 months, we will build an iOS version of the app and add it to the App Store.

In the code below, we generate frequency tables to determine the number of apps associated with each genre in the Android and Apple app stores. 

Based on the frequency tables that are generated, the most common genre for Apple apps is Games and the most common genre for Android apps is Tools.

In [10]:
apple_genres = {}
android_genres = {}

for row in apple_apps_final:
    genre = row[11]
    if genre in apple_genres:
        apple_genres[genre] += 1
    else:
        apple_genres[genre] = 1
        
for row in android_apps_final:
    genre = row[9]
    if genre in android_genres:
        android_genres[genre] += 1
    else:
        android_genres[genre] = 1
      
print(apple_genres)
print(android_genres)

{'Social Networking': 106, 'Photo & Video': 160, 'Games': 1874, 'Music': 66, 'Reference': 18, 'Health & Fitness': 65, 'Weather': 28, 'Utilities': 81, 'Travel': 40, 'Shopping': 84, 'News': 43, 'Navigation': 6, 'Lifestyle': 51, 'Entertainment': 254, 'Food & Drink': 26, 'Sports': 69, 'Book': 14, 'Finance': 36, 'Education': 118, 'Productivity': 56, 'Business': 17, 'Catalogs': 4, 'Medical': 6}
{'Art & Design': 53, 'Art & Design;Creativity': 6, 'Auto & Vehicles': 82, 'Beauty': 53, 'Books & Reference': 190, 'Business': 407, 'Comics': 54, 'Comics;Creativity': 1, 'Communication': 287, 'Dating': 165, 'Education': 474, 'Education;Creativity': 4, 'Education;Education': 30, 'Education;Pretend Play': 5, 'Education;Brain Games': 3, 'Entertainment': 538, 'Entertainment;Brain Games': 7, 'Entertainment;Creativity': 3, 'Entertainment;Music & Video': 15, 'Events': 63, 'Finance': 328, 'Food & Drink': 110, 'Health & Fitness': 273, 'House & Home': 73, 'Libraries & Demo': 83, 'Lifestyle': 345, 'Lifestyle;Pret

In the code below, we build functions that generate frequency tables to show the percentages of the genres in each dataset.

In [11]:
def freq_table(dataset, index):
    frequency_table = {}
    for row in dataset:
        genre = row[index]
        if genre in frequency_table:
            frequency_table[genre] += 1
        else:
            frequency_table[genre] = 1
    
    frequency_percent = {}
    for x in frequency_table:
        percentage = frequency_table[x] / len(dataset) * 100
        frequency_percent[x] = percentage
    return frequency_percent

# The function below was written by Dataquest.io and added to this project per their instruction.
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print('Apple prime_genre column data: ')        
display_table(apple_apps_final, 11)
print('\n')
print('Android Genres column data: ')
display_table(android_apps_final, 9)
print('\n')
print('Android Category column data: ')
display_table(android_apps_final, 1)


Apple prime_genre column data: 
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Android Genres column data: 
Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.5311371841

## Frequency table analysis
### Prime_genre column (Apple)
* The most common genre is Games (58.16 %). The second most common genre is Entertainment (7.88%).
* A pattern among the rest of the apps in the dataset is that most are a much smaller percentage of the total apps- for example, all but 2 of the Genres are below 4.9% of the total.
* The majority of apps in the dataset are in an entertainment-related genre (Games, Entertainment, etc.)
* Recommendation: Based on the Apple apps dataset that we analyzed, it would be best for the developers of our app to create an app in one of the top genres: Games, Entertainment or Photo and Video. Becuase these genres compose the majority of apps in the dataset, it means that users have the most interest in these genres.

### Genres column (Android)
* The most common genre is Tools (8.44%). The second most common genre is Entertainment (6.06%).
* A pattern in this dataset is that there is not 1 genre that takes up a much larger percentage of the frequency table, like in the Apple app dataset. The app genres here are more evenly distributed. There are also many more genres and sub-genres in this dataset versus in the Apple app dataset.
* Recommendation: The most common Android app genres are for productivity (Tools, Education, Business, Productivity), so it would be a good idea for our app developers to make an app in one of these genres.

### Category column (Android)
* The most common genre is . The second most common genre is FAMILY (18.90%). The second most common genre is GAME (9.72%).
* A pattern in this dataset is that there are fewer categories than there are genres in the Android Genres column. There are also no sub-categories like in the other Android dataset. There is also 1 category that is a larger percentage of the dataset than any other category, similar to the Apple app dataset.
* Recommendation: The most common categories for Android genres are: Family, Game, and Tools, so it would be a good idea for our app developers to make an app in one of these genre categories.

In the code below, we create frequency tables to find out which genres are the most popular. We determine this based on the number of installs for Android apps and the number of reviews for Apple apps (although the number of reviews is not the most accurate way to determine the number of downloads, the dataset does not provide the number of downloads for each app).

App Store apps recommendation: Based on the data above, the most commonly downloaded app genre is Reference. I recommend that our app developers to create an app in that genre because the barrier to entry is low. The types of apps in this category would be easy to develop and have a good value proposition for users compared to apps in others genres.

In [12]:
apple_freq_table = freq_table(apple_apps_final, 11)

for genre in apple_freq_table:
    total = 0
    len_genre = 0
    
    for app in apple_apps_final:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
        
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)
        

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


Google Play store app recommendation: The most commonly downloaded genre of apps in the Google Play store is Social Media. My recommendation for our app developers is to develop an app in a different genre- possibly Communication or Productivity. These genres have a lower barrier to entry for an app developer. Social Media apps would take longer to become popular because it depends on having a user base, which takes time to develop.

In [13]:
category_freq_table = freq_table(android_apps_final, 1)

for category in category_freq_table:
    total = 0
    len_category = 0
    
    for app in android_apps_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '').replace(',', '')
            total += float(n_installs)
            len_category += 1
            
    avg_installs = total / len_category
    print(category, ':', avg_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

In the code below, we analyze the average number of reviews for each app. We do this by creating a frequency table for the ratings for each app and calcualating the average number of reviews for each rating.

App Store app recomm

In [14]:
# apple_ratings_table = freq_table(apple_apps_final, 7)

# # for genre in apple_ratings_table:
# #     total = 0
# #     len_ratings = 0
    
# #     for app in apple_apps_final:
# #         app_genre = app[-5]
# #         if app_genre == genre:
# #             app_rating = float(app[5])
# #             total += app_rating
# #             len_ratings += 1
        
# #     avg_n_ratings = total / len_ratings
# #     print(genre, ':', avg_n_ratings)

# print(apple_ratings_table)


In [17]:
def freq_table_avg(dataset, index):
    frequency_table = {}
    for row in dataset:
        rating = row[index]
        if rating in frequency_table:
            frequency_table[rating] += 1
        else:
            frequency_table[rating] = 1
    
    frequency_avg = {}
    for x in frequency_table:
        average = x / frequency_table[x]
        frequency_avg[x] = average
    return frequency_avg

apple_ratings_table = freq_table_avg(apple_apps_final, 7)
print(apple_ratings_table)

TypeError: unsupported operand type(s) for /: 'str' and 'int'