# Appstore data analyzes

**This project analyzes the user engagement with free downloadable apps. The revenues are generates by in-app ads only. The goal of this analyzes is to understand, what type of app generates the most revenues.**

In [1]:
from csv import reader
apple_opened_file  = open('AppleStore.csv', encoding = 'utf8')
google_opened_file = open('googleplaystore.csv', encoding = 'utf8')
apple_readfile = reader (apple_opened_file)
apple_apps_data = list (apple_readfile)
google_readfile  = reader (google_opened_file)
google_apps_data = list (google_readfile)
index = 0

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_apps_data,1,4,True)        
explore_data(google_apps_data,1,4,True)

explore_data(apple_apps_data,0,1)  
explore_data(google_apps_data,0,1)

for row in google_apps_data[1:]:
    index += 1
    if row[6] == 'NaN':
        print(row)
        print('Index: ' + str(index))
        found_index = index
        

        
del google_apps_data[found_index]
        


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2

**We have to remove duplicate Data before we can proceed with the analyze**

Let's check for example the Instagram App:

In [2]:
for row in google_apps_data[1:]:
    if row[0] == 'Instagram':
        print(row)       

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [3]:
duplicated_entries = []
unique_entries = []

for row in google_apps_data[1:]:
    if row[0] in unique_entries:
        duplicated_entries.append(row[0])
    else:
        unique_entries.append(row[0])
        
print('Number of duplicated entries: ', len(duplicated_entries))

Number of duplicated entries:  1181


**The duplicated entries will not be removed randomly. To ensure a better quality of the outcome process, we will use the most recent app data for duplicates. This means, the app with the highest number of reviews will remain, which is the most recent review.**

**Now we will create a dictionary which is filled by the app title and the corresponding review counts.**

In [4]:
reviews_max = {}

# We need to delete a row where the review number is actually a string
del google_apps_data[10472]

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews or name not in reviews_max:
        reviews_max[name] = n_reviews
print('\n', len(reviews_max), '\n')


 9658 



**Here we will remove the duplicated rows by using two lists. One stores the cleaned dataset and the other list stores the app titles for tracking.**

In [5]:
android_clean = []
already_added = []

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

# Test print
for row in android_clean:
    if row[0] == 'Instagram':
        print(row)
        
print('Number unique of apps:\n', len(android_clean))


        
    

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Number unique of apps:
 9658


**This step is needed to detect non-English characters by parsing the ASCII code, which should not exceeding 127. 3 Special characters are allowed. In a new list all non-English apps are filtered out.**

In [6]:
def check_apptitle(title):
    counter = 0
    for char in title:
        if ord(char) > 127:
            counter += 1
            if counter > 3:
                return False
    return True    
     
print ('Instagram ', check_apptitle('Instagram'))
print ('爱奇艺PPS -《欢乐颂2》电视剧热播 ', check_apptitle('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print ('Docs To Go™ Free Office Suite ', check_apptitle('Docs To Go™ Free Office Suite'))
print ('Instachat 😜 ', check_apptitle('Instachat 😜'))

english_apps = []

for row in android_clean:
    if check_apptitle(row[0]):
        english_apps.append(row)
        
print('Number of English apps:\n', len(english_apps))

Instagram  True
爱奇艺PPS -《欢乐颂2》电视剧热播  False
Docs To Go™ Free Office Suite  True
Instachat 😜  True
Number of English apps:
 9613


**We will isolate free apps in a new dataset**

In [7]:
free_apps = []
for row in english_apps:
    if row[6] == 'Free':
        free_apps.append(row)

print('Number of free apps:\n', len(free_apps))    
    

Number of free apps:
 8863


**We want to isolate minimal Android apps with a good response by the users and a profitabilty after six months. AFter this, and iOS can be build and added to the App Store.**

In [8]:
def freq_table (dataset, index):
    freq_tb = {}
    for row in dataset:
        if row[index] in freq_tb:
            freq_tb[row[index]] += 1
        else:
            freq_tb[row[index]] = 1
    return freq_tb

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
print ('\n Playstore Category:')
display_table(google_apps_data,1)   
print ('\n Playstore Genres:')
display_table(google_apps_data,9)  
print ('\n App Store prime_genre:')
display_table(apple_apps_data,11) 
       
       


 Playstore Category:
FAMILY : 1971
GAME : 1144
TOOLS : 843
MEDICAL : 463
BUSINESS : 460
PRODUCTIVITY : 424
PERSONALIZATION : 392
COMMUNICATION : 387
SPORTS : 384
LIFESTYLE : 382
FINANCE : 366
HEALTH_AND_FITNESS : 341
PHOTOGRAPHY : 335
SOCIAL : 295
NEWS_AND_MAGAZINES : 283
SHOPPING : 260
TRAVEL_AND_LOCAL : 258
DATING : 234
BOOKS_AND_REFERENCE : 231
VIDEO_PLAYERS : 175
EDUCATION : 156
ENTERTAINMENT : 149
MAPS_AND_NAVIGATION : 137
FOOD_AND_DRINK : 127
HOUSE_AND_HOME : 88
LIBRARIES_AND_DEMO : 85
AUTO_AND_VEHICLES : 85
WEATHER : 82
ART_AND_DESIGN : 65
EVENTS : 64
PARENTING : 60
COMICS : 60
BEAUTY : 53
Category : 1

 Playstore Genres:
Tools : 842
Entertainment : 623
Education : 549
Medical : 463
Business : 460
Productivity : 424
Sports : 398
Personalization : 392
Communication : 387
Lifestyle : 381
Finance : 366
Action : 365
Health & Fitness : 341
Photography : 335
Social : 295
News & Magazines : 283
Shopping : 260
Travel & Local : 257
Dating : 234
Books & Reference : 231
Arcade : 220
Simul

**In this step we will count the install per genre for the app store to estimate the popularity for each app. We will use the total ratings per app as a proxy for the installs.**

In [30]:
freq_prime_genre = {}
installs_per_genre = {}
freq_prime_genre = freq_table (apple_apps_data[1:],11)

for genre in freq_prime_genre:
    total = 0
    len_genre = 0
    for row in apple_apps_data[1:]:
        genre_app = row[11]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    avg_rating = total / len_genre
    installs_per_genre[genre] = int(avg_rating)
    print ('Average number of ratings for',genre,'is', int(avg_rating))
    
print ('\nMost popular app by genre is recommendet for', max(installs_per_genre, key = installs_per_genre.get))
        

Average number of ratings for Music is 28842
Average number of ratings for Photo & Video is 14352
Average number of ratings for Reference is 22410
Average number of ratings for Entertainment is 7533
Average number of ratings for Social Networking is 45498
Average number of ratings for News is 13015
Average number of ratings for Travel is 14129
Average number of ratings for Sports is 14026
Average number of ratings for Productivity is 8051
Average number of ratings for Education is 2239
Average number of ratings for Finance is 11047
Average number of ratings for Business is 4788
Average number of ratings for Health & Fitness is 9913
Average number of ratings for Games is 13691
Average number of ratings for Food & Drink is 13938
Average number of ratings for Utilities is 6863
Average number of ratings for Book is 5125
Average number of ratings for Navigation is 11853
Average number of ratings for Weather is 22181
Average number of ratings for Catalogs is 1732
Average number of ratings fo

**Now we use the installs / app for the playstore as a feature for popularity of category.**

In [35]:
freq_category = {}
installs_per_category = {}
freq_category = freq_table(google_apps_data[1:],1)

for category in freq_category:
    total = 0
    len_category = 0
    for row in google_apps_data[1:]:
        genre_app = row[1]
        if genre_app == category:
            installs = row[5].replace(',','')
            installs = installs.replace('+','')
            total += float(installs)
            len_category += 1
    avg_rating = total / len_category
    installs_per_category[category] = int(avg_rating)
    print ('Average number of installs for',category,'is', int(avg_rating))
    
print ('\nMost popular app by category is recommendet for', max(installs_per_category, key = installs_per_category.get))    

Average number of installs for MAPS_AND_NAVIGATION is 5286729
Average number of installs for EVENTS is 249580
Average number of installs for COMICS is 934769
Average number of installs for GAME is 30669601
Average number of installs for HEALTH_AND_FITNESS is 4642441
Average number of installs for LIFESTYLE is 1407443
Average number of installs for SHOPPING is 12491726
Average number of installs for LIBRARIES_AND_DEMO is 741128
Average number of installs for ENTERTAINMENT is 19256107
Average number of installs for PARENTING is 525351
Average number of installs for FAMILY is 5204598
Average number of installs for NEWS_AND_MAGAZINES is 26488755
Average number of installs for PERSONALIZATION is 5932384
Average number of installs for TOOLS is 13585731
Average number of installs for TRAVEL_AND_LOCAL is 26623593
Average number of installs for VIDEO_PLAYERS is 35554301
Average number of installs for BEAUTY is 513151
Average number of installs for FOOD_AND_DRINK is 2156683
Average number of ins