# Project 1: Profitable App Profiles for the App Store and Google Play Markets

### Project goals
I work for a company which builds Android and IOS mobile apps, and make them available on their respective stores. The apps are free to download and install, with our main source of revenue being in-app adds. So, the number of app users determine our revenue. The **goal** is to analyze which apps are likely to attract more users. To do this, we need to colelct and analyze data about both mobile apps on Google play store and IOS play store.

### Dataset
The dataset and the corresponding documentation for the Android apps from the Google Play store can be downloaded from this [link](https://www.kaggle.com/datasets/lava18/google-play-store-apps).
Similarly, the dataset and the corresponding documentation for the Android apps from the IOS appstore can be downloaded from this [link](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps).

### Prerequisites
1. The datset must be free.
2. The dataset must be in English.

## 1. Opening and Exploring the datasets.

In [1]:
from csv import reader

#Importing the Google Playstore dataset
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android_dataset = list(read_file)

#importing the IOS appstore dataset
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios_dataset = list(read_file)

In [2]:
#exploring the datasets

#funtion to explore the dataset

def explore_dataset(dataset, title, number_of_rows, header = True):
    print('Current Dataset')
    print(title)
    print('\n')
    if header:
        print('Header')
        print(dataset[0])
        print('\n')
        print('Number of Rows')
        print(len(dataset[0]))
        print('\n')
        for row in dataset[1:number_of_rows+1]:
            print(row)
            print('\n')
        
print(explore_dataset(android_dataset, 'Google Play Store Dataset', 5))
print(explore_dataset(ios_dataset, 'Apple App Store Dataset', 5))

Current Dataset
Google Play Store Dataset


Header
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of Rows
13


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book'

## 2. Data Cleaning.

### 2.1 Deleting Wrong Data.

In this step, we delete the data record which is entered wrongly. For instance, in the `android_dataset`, there is a problem in the data record 10472.

In [3]:
print(android_dataset[0])
print(android_dataset[10473])
print(len(android_dataset[10473])) #There are only 12 data records. So, the content rating is missing

del android_dataset[10473]

print(android_dataset[0])
print(android_dataset[10473])
print(len(android_dataset[10473]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
13


### 2.2 Removing Duplicate entries

In this step, we remove the apps which have multiple entries. The criteria for removing the duplicates is the number of reviews with a reasoning that the more recent data would have more reviews. So, we keep the dataset with most reviews and delete the others.

In [None]:
# Checking for the duplicates
def check_duplicates(dataset, appsname_index_number, dataset_name):
    duplicate_apps = []
    non_duplicate_apps = []
    for row in dataset[1:]:
        app_name = row[appsname_index_number]
        if app_name in non_duplicate_apps:
            duplicate_apps.append(app_name)
        else:
            non_duplicate_apps.append(app_name)
    print(dataset_name)
    print('The Duplicate Apps')
    print('The number of duplicate apps are ',len(duplicate_apps))
    return duplicate_apps

android_duplicate_apps = check_duplicates(android_dataset, 0, 'Google dataset')
print('\n')
ios_duplicate_apps = check_duplicates(ios_dataset, 1, 'IOS dataset')

In [None]:
#Removing the duplicates

def remove_duplicates(dataset, appsname_index_number, reviews_index_number):
    reviews_max = {}
    for row in dataset[1:]:
        name = row[appsname_index_number]
        n_reviews = float(row[reviews_index_number])
        
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
        
    clean_data = []
    already_added = []

    for row in dataset[1:]:
        name = row[appsname_index_number]
        n_reviews = float(row[reviews_index_number])
    
        if n_reviews == reviews_max[name] and name not in already_added:
            clean_data.append(row)
            already_added.append(name)
            
    return clean_data

android_dataset_no_duplicates = remove_duplicates(android_dataset, 0, 3)
print(len(android_dataset_no_duplicates))

ios_dataset_no_duplicates = remove_duplicates(ios_dataset, 1, 5)
print(len(ios_dataset_no_duplicates))

### 2.3 Removing Non-English Apps.

At our company, we only deal with english apps. So, we need to remove the apps which are not available in English. To do this, we use ``ord()`` function. This returns the ASCII value of the element. For english characters, the value is not greater than 127. To take into account the smilies and sub\super scripts, we will remove an app if the name has more than 3 non-english characters.

In [None]:
#Check if the app is english or not
def english_or_not(string):
    total = 0
    for character in string:
        
        if ord(character) > 127:
            total += 1
            
        if total > 3:
            return False
    return True

print(english_or_not('Instagram'))
print(english_or_not('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_or_not('Docs To Go™ Free Office Suite'))
print(english_or_not('Instachat 😜'))

In [None]:
# Function to check if the app name is English
def english_or_not(string):
    total = 0
    for character in string:
        if ord(character) > 127:
            total += 1
        if total > 3:
            return False
    return True

# Remove non-English apps from Android dataset
android_dataset_english = []
for row in android_dataset_no_duplicates:
    name = row[0]
    if english_or_not(name):
        android_dataset_english.append(row)

# Remove non-English apps from iOS dataset
ios_dataset_english = []
for row in ios_dataset_no_duplicates:
    name = row[1]
    if english_or_not(name):
        ios_dataset_english.append(row)  
        
# Print the number of English apps remaining
print(len(android_dataset_english))
print(len(ios_dataset_english))


### 2.4 Isolating the free Apps.

In [None]:
android_final = []
ios_final = []

for app in android_dataset_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_dataset_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

## 3. Data Analysis

### 3.1 Most Common app by Genre

The end goal is to add the newly developed app on both the Play store and the App store. So, we need to find the app profiles which are successful in both the markets. So, we start by determining the most common genre for each market by using frequency table. By default, the frequency tables are not sorted. We try to sort them for easy readibility.

In [None]:
# Google play store
android_genre_count = {}

for row in android_final:
    genre = row[1]
    if genre in android_genre_count:
        android_genre_count[genre] += 1
    else:
        android_genre_count[genre] = 1
        
# IOS App store
ios_genre_count = {}

for row in ios_final:
    genre = row[-5]
    if genre in ios_genre_count:
        ios_genre_count[genre] += 1
    else:
        ios_genre_count[genre] = 1  
        
#in percentages: frequency table
def freq_table_percentages(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    
    for value in table:
        percentages = (table[value]/total)*100
        table_percentages[value] = percentages
    print(total)
    return table_percentages

def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
            
    return table

android_genre_percentages = freq_table_percentages(android_final, 1)
ios_genre_percentages = freq_table_percentages(ios_final, -5)


In [None]:
#printing the frequency table after sorting them
def display_table(dataset, index):
    table = freq_table_percentages(dataset, index)
    table_display = []
    
    for key in table:
        key_val_tuple = (table[key], key)
        table_display.append(key_val_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
        
print(display_table(android_final, 1))
print('\n')
print(display_table(ios_final, -5))
print('\n')
print(display_table(android_final, -4))

### Discussion
The analysis of the data shows that the AppStore is dominated by Games. THe Google playstore is more balanced with Family, Game and tools having similar polularity. On a closer look on the google play store, we see that the Family category have games in it. On a whole, Games seems to be the most common genre for free and english apps on both the App store and Play store. Bu, this tells us only about which apps are present more in number and not ablut the popularity of the apps in terms of the users using it. 

### 3.2 Most popular app by Genre
Analyzing this metric gives us an idea of average user ratings per app genre. This can be thought of as the total number of user ratings or installs.

### 3.2.1 In AppStore of IOS

In [None]:
genres_ios = freq_table(ios_final, -5)


for genre in genres_ios:
    total = 0
    genre_count = 0
    
    for row in ios_final:
        genre_1 = row[-5]
        
        if genre == genre_1:
            n_ratings = float(row[5])
            total += n_ratings
            genre_count += 1
    avg_ratings = total / genre_count
    print(genre + ' : ' + str(avg_ratings))
        


In [None]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5])

In [None]:
for app in ios_final:
    if app[-5] == 'Social Networking':
        print(app[1], ':', app[5])

In [None]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

### Discussion
The nagivation apps seems to highest average customer ratings. But, this data is highly skewed because of Waze and Google maps. On an other hand, Social Networking apps are more balanced with most ratings belonging to Facebook and Pininterest. Similarly, in Reference genre, the Bible and dictornary skews up the entire ratings. This presents an interesting idea that a famous or much read book can be turned into an app. 

### 3.2.2 In PlayStore of Google

In [None]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

In [None]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

In [None]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+'
                                            or app[5] == '500,000,000+'
                                            or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

### Discussion
The communication apps have the most customer ratings, but this statistic is highly skewed due to a group of most-downloaded apps such as facebook, Whatsapp and Skype. This number would come down dramatically if these popular apps would not be considered. It is also found that, on android, the books and reference apps have the highest downloads. But, due to the large number of libraries present in the Google play store, it would become necessary to have a differentiator between our app and other apps by adding a few extra features. 

## Conclusion
On both the appstore and play store, the apps concerned with online reading or e-books are most downloaded. This might be a niche market to explore and might actually be practical to develop an app in this genre. Other genres might be possible but difficult to monitize as social newtworking apps have a high customer churn. other apps like weather and finance apps might need more initial capital upfront to hire domain experts. But, it is important to have a differentiating feature between our app and other apps on the market.