# Analysis of succesful apps in Google Play and the AppStore

What it takes to make a free app successful? The aim of this analysis is to englight the key factors in making a free app a success, studying 10,000 apps from Google Paly and about 7,000 apss from the App Store, through data collected in 2018.

Both dataset are from Kaggle, info about the Google Play Store Apps dataset can be found [here](https://www.kaggle.com/lava18/google-play-store-apps),
while info about the Apple Store Apps can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

### Technologies and helper functions

This study uses only pure Python and no other libreries.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
def open_dataset(file_name='AppleStore.csv', remove_header = True):
    opened_file = open(file_name)
    from csv import reader
    read_file = reader(opened_file)
    data = list(read_file)
    if remove_header:
        return data[1:]
    
    return data

### Data exploration

In [3]:
apple_data_full = open_dataset(remove_header = False)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2693: character maps to <undefined>

In [None]:
apple_header = apple_data_full[0]
apple_header

In [None]:
print(len(apple_header))

In [None]:
apple_data = apple_data_full[1:]

In [None]:
print(len(apple_data))

In [None]:
explore_data(apple_data, 0, 5)

In [None]:
google_play_full = open_dataset('googleplaystore.csv', remove_header = False)

In [None]:
google_play_header = google_play_full[0]
google_play_header

In [None]:
print(len(google_play_header))

In [None]:
google_play = google_play_full[1:]

In [None]:
print(len(google_play))

In [None]:
explore_data(google_play, 0, 5)

## Data cleaning

### Removing row with errors

The Google Play dataset has a problem at row  10472, as stated by [this discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on Kaggle <br>
The column 'Category' is missing, so there is shift for the remaining data, in the row.

In [None]:
problem_row = google_play[10472]
print(google_play[10472])

In [None]:
print(len(problem_row))

In [None]:
#removing the row 10472
del google_play[10472]

In [None]:
print(len(google_play))

### Removing duplicates

The Google Play dataset has some duplicate rows, as we can see here for the Intagram App.

In [None]:
for app in google_play:
    name = app[0]
    if name == 'Instagram':
        print(app)

There are four Instagram apps: they are the same app, but with data collected at different times, with increasing number of user ratings. <br>
We are going to mantain only the app with the maximum number of user ratings, the most recent one, and remove all the older entries with less user ratings.
The same procedure will be applied to all the duplicate apps in the dataset.

First we are going to count all the duplicates apps

In [None]:
duplicate_apps_gp = []
unique_apps_gp = []

for app in google_play:
    name = app[0]
    if name in unique_apps_gp:
        duplicate_apps_gp.append(name)
    else:
        unique_apps_gp.append(name)

In [None]:
print('Number of duplicate apps: ', len(duplicate_apps_gp))

There are 1181 duplicate apps in the Google Play data

In [None]:
##number of rows expected after removal of duplcates
len(google_play) - len(duplicate_apps_gp)

In [None]:
duplicate_apps_apple = []
unique_apps_apple = []

for app in apple_data:
    name = app[0]
    if name in unique_apps_apple:
        duplicate_apps_apple.append(name)
    else:
        unique_apps_apple.append(name)

In [None]:
print('Number of duplicate apple apps: ', len(duplicate_apps_apple))

There are no duplicate apps for Apple.

we are going to create a dictionary for mapping the apps in Google Play withe highest number of user ratings and keep only those app the Google Play data.

In [None]:
#the number of reviews for Google Play data is at index 3
rating_index = 3
app_name_index = 0
reviews_max = {}
for app in google_play:
    name = app[app_name_index]
    user_ratings = float(app[rating_index])
    if name in reviews_max:
        if user_ratings > reviews_max[name] :
            reviews_max[name] = user_ratings
    else:
        reviews_max[name] = user_ratings
        
print(len(reviews_max))        
        

In [None]:
#removing duplicates
google_play_clean = []
already_added = []
for app in google_play:
    name = app[app_name_index]
    user_ratings = float(app[rating_index])
    max_user_rating = reviews_max[name]
    if(max_user_rating == user_ratings and name not in already_added):
        google_play_clean.append(app)
        already_added.append(name)
        
print(len(google_play_clean)) 
print(len(already_added))
   
    

### Removing non-English app names

We want to remove apps not in English, so we'll test the app names for the presence of non-English characters.

In [None]:
def is_app_name_English(app_name):
    non_eng_counter = 0
    for letter in app_name:
        if ord(letter) > 127:
            non_eng_counter += 1
    return non_eng_counter <= 3

In [None]:
#testing function
instagram_name  = 'Instagram'
chinese_name = '爱奇艺PPS -《欢乐颂2》电视剧热播'
tm_name = 'Docs To Go™ Free Office Suite'
emoji_name = 'Instachat 😜'
testing_names= [instagram_name, chinese_name, tm_name, emoji_name]

In [None]:
for name in testing_names:
    print(is_app_name_English(name))

In [None]:
google_play_eng = []
apple_eng = []
#the index of the app name is different from google play apps
apple_name_index = 1
for app in google_play_clean:
    
    if is_app_name_English(app[app_name_index]):
        google_play_eng.append(app)
        
for app in apple_data:
      if is_app_name_English(app[apple_name_index]):
        apple_eng.append(app)

In [None]:
print(len(google_play_eng))

In [None]:
print(len(apple_eng))

### Isolating free apps

We are going to consider only free apps, with revenue provided by ads, in this study, so we are going to remove non free apps from both datasets.

In [None]:
gp_price_index = 7
apple_price_index = 4

In [None]:
gp_free = []
apple_free = []

for app in google_play_eng:
    price = (app[gp_price_index]).strip('$')
    price = float(price)
    if price == 0.0:
        gp_free.append(app)
        
for app in apple_eng:
    price = (app[apple_price_index]).strip('$')
    price = float(price)
    if price == 0.0:
        apple_free.append(app)
    

In [None]:
print(len(gp_free))
print(len(apple_free))

### Strategy for finding successful apps features

One of the strategy for building successful apps, consists in building a fast minimal prototype in android, sperimenting it on Google Play and if the app has enough response from user, start building a iOs app for the AppStore. <br>
In order to minimize costs and risks, we need to find the minimal set of characteristics defining the profile of a successful app and this is what we'll  do in the next sections of this analysis.

### Most common genres

In [None]:
gp_genre_index = -4
apple_genre_index = 11

In [None]:
def freq_table(data_set, index):
    frequency_table = {}
    
    for row in data_set[1:]:
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    total_apps = len(data_set)
    percentage_table = {}
    for key in frequency_table:
        freq = frequency_table[key]
        percentage_table[key] = freq/total_apps * 100
        
        
    return percentage_table

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [None]:
display_table(gp_free,gp_genre_index)

In [None]:
display_table(apple_free, apple_genre_index)

The most common genre for free apps in the Apple Store is clearly Games, with over 58% of apps followed by nearly 8% of apps dedicated to enterteinment. <br>
The Google Play store shows instead a greater fragmentation of genres, with the most popular Genre, Tools, at only 8.4% followed by Entertainment at 6% .

### Measuring popularity: number of installs

Until now we focused our analysis on the number of apps in both Google Play and the AppStore, considering the frequency of genres, but the choice of the app genre is made by developers, we are more interested in the choice made by users, because the real success of an app is always determined by users. <br>
One way to measure the interests of users for an app is certainly the number of installs of the app. <br>
The two datasets we are considering both miss precise data about the number of installs, but we have data about the number of ratings for the Apple Store and a a category of installs for Google Play. <br>
We'll make an assumption: the number of ratings is proportional to the number of installs (we don't have data to prove it, but it seems to make sense). <br>
We will measure the average number of user ratings for genre.

#### Number of Install for the App Store

In [None]:
apple_genres = freq_table(apple_free, apple_genre_index)

In [None]:
total_genres = len(apple_genres)
total_genres

In [None]:
apple_tot_ratings_index = 5
apple_avg_ratings_for_genres ={}
for genre in apple_genres:
    total = 0
    total_user_ratings = 0
    for app in apple_free:
        genre_app = app[apple_genre_index]
        if(genre_app == genre):
            total += 1
            total_user_ratings += float(app[apple_tot_ratings_index])
    apple_avg_ratings_for_genres[genre]  = total_user_ratings/total
 

In [None]:
def display_dict(dictionary):
    for key in dictionary:
        print("{} : {}".format(key, dictionary[key]))

In [None]:
display_dict(apple_avg_ratings_for_genres)

Considering the average number of install per genres, games don't dominate the scene anymore in the AppStore. <br>
Navigation is the most popular genre with over 86000 average install, followed by Music apps and Weather apps. 

#### Number of Installs for Google Play

In [None]:
google_play_categories =  freq_table(gp_free,1)

In [None]:
google_play_categories

In [None]:
display_table(gp_free, 5)

In [None]:
gp_avg_installs ={}
for category in google_play_categories:
    total = 0
    len_category = 0
    for app in gp_free:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    gp_avg_installs[category] = avg_n_installs
   
    

In [None]:
display_dict(gp_avg_installs)

Comunications seem to be the most popular genre for android apps