# Profitable App Profiles for the App Store and Google Play Markets

In this project, I am working as **Data Analyst** for a company that builds Android and iOS mobile apps. My job is to enable the company to make data-driven decisions with respect to the kind of apps they build.

At this company, they only build free apps to download and install, and the main source of revenue consists of in-app ads. This ads revenue is mostly influenced by the number of users that use the app. My goal for this project is to analyze data to help developers understand what kinds of apps are likely to attract more users.

## Opening and analyzing data

Collecting data of all the store apps requires a significant amount of time and money, so I'll try to analyze a sample of data instead. To avoid spending resources (time and/or money), I will use two data sets that seem suitable for our purpose:

* A [Google Play](https://www.kaggle.com/lava18/google-play-store-apps) and an [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) Kaggle dataset suitable for this project with 17k apps to analyze.

### Opening datasets

I will start by opening the two data sets and then continue with exploring the data.

In [1]:
from csv import reader

# returns the csv data as a list
def open_dataset(path):
    opened_file = open(path)
    readed_file = reader(opened_file)
    data_list = list(readed_file)
    return data_list

In [2]:
# android apps
android_list = open_dataset('googleplaystore.csv')
android_apps_header = android_list[0]
android_apps = android_list[1:]

# ios apps
ios_list = open_dataset('applestore.csv')
ios_apps_header = ios_list[0]
ios_apps = ios_list[1:]

### Exploring data

To make it easier to explore the two data sets, I'll write a function named _explore_dataset()_ that I can use repeatedly to explore rows in a more readable way

In [3]:
def explore_dataset(dataset, start, end=0):
    if end == 0:
        end = start + 1
        
    dataset_slice = dataset[start:end]

    for row in dataset_slice:
        print(row)

Also, I will create another function named _dataset_size()_ that shows me the rows and columns length

In [4]:
def dataset_size(dataset):
    dataset_rows = len(dataset)
    dataset_columns = len(dataset[0])

    # print values
    print('number of rows:', dataset_rows)
    print('number of columns:', dataset_columns)
    
    # return tuple
    data = (dataset_rows, dataset_columns)
    return data

In [5]:
print('Android Dataset')
print('================')
print('\n')

explore_dataset(android_apps, 0, 2)
print('\n')
dataset_size(android_apps)
print('\n')

Android Dataset


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


number of rows: 10841
number of columns: 13




In [6]:
print('IOS Dataset')
print('===========')
print('\n')

explore_dataset(ios_apps, 0, 2)
print('\n')
dataset_size(ios_apps)
print('\n')

IOS Dataset


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


number of rows: 7197
number of columns: 16




### Results

The data shows 10841 Android apps and 7197 iOS apps inside datasets

## Cleaning Data

### Removing rows with errors, missing or wrong data

It is important to read [Android Dataset Documentation](https://www.kaggle.com/lava18/google-play-store-apps) and [App Store Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) to check information about the columns and an explanation of how data is collected and saved.  

The Google Play dataset has a dedicated discussion section, and we can see that [one](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) of the discussions outlines an error for row 10472. Print this row and compare it against the header and another row that is correct.

In [7]:
print('Header of the Android dataset')
print('=====================')
print(android_apps_header)
print('\n')

print('Wrong row!')
print('==========')
explore_dataset(android_apps, 10472)
print('\n')

print('Correct row')
print('============')
explore_dataset(android_apps, 1500)

Header of the Android dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Wrong row!
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Correct row
['Zumper - Apartment Rental Finder', 'HOUSE_AND_HOME', '4.4', '11200', '25M', '1,000,000+', 'Free', '0', 'Everyone', 'House & Home', 'July 16, 2018', '4.5.15', '5.0 and up']


The row **10472** corresponds to the app *Life Made WI-Fi Touchscreen Photo Frame*, and I can see that the rating is __19__. This is clearly off because the maximum rating for a Google Play app is 5. As a consequence, I'll delete this row.

In [8]:
print('Before deleting:')
dataset_size(android_apps)

del android_apps[10472]  # don't run this more than once to del the row
print('\n')

print('After deleting:')
dataset_size(android_apps)

Before deleting:
number of rows: 10841
number of columns: 13


After deleting:
number of rows: 10840
number of columns: 13


(10840, 13)

### Isolating Free Apps

The goal of the company is to create a _FREE APP_ that its main source of revenue consists of in-app ads. The datasets contain both free and non-free apps so I will remove non-free apps.

In [9]:
print('Android Columns')
print('===============')
print(android_apps_header)
print('\n')

print('IOS Columns')
print('===============')
print(ios_apps_header)
print('\n')

Android Columns
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


IOS Columns
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




On __Android__ the column 8 has the price. In the __IOS__ apps, the column that has the price is the column number 5. I will make a function that returns the dataset without non-free apps.

In [10]:
import re
def remove_non_free(dataset, price_column):
    only_free = []
    for row in dataset:
        price = re.findall(r'\d+', row[price_column])[0]
        if price == '0':
            only_free.append(row)
    return only_free

Now I have to create new datasets

In [11]:
free_android_apps = remove_non_free(android_apps, price_column=7)
free_ios_apps = remove_non_free(ios_apps, price_column=4)

Check both lists to see the difference

In [18]:
print('previous apps dataset')
print('=====================')
print('Android')
android_apps_rows, _ = dataset_size(android_apps)
print('IOS')
ios_apps_rows, _ = dataset_size(ios_apps)
print('\n')

print('free apps dataset')
print('=====================')
print('Android')
free_android_apps_rows, _ = dataset_size(free_android_apps)
print('IOS')
free_ios_apps_rows, _ = dataset_size(free_ios_apps)
print('\n')

previous apps dataset
Android
number of rows: 10840
number of columns: 13
IOS
number of rows: 7197
number of columns: 16


free apps dataset
Android
number of rows: 10188
number of columns: 13
IOS
number of rows: 4784
number of columns: 16




In [19]:
# get percentage that represents a part from a total
def percentage(part, total):
  return 100 * float(part)/float(total)

perc_android_apps = percentage(free_android_apps_rows, android_apps_rows)
perc_ios_apps = percentage(free_ios_apps_rows, ios_apps_rows)

print('In case of Android apps, {}% of them are free!'.format(perc_android_apps))
print('In case of IOS apps, {}% of them are free!'.format(perc_ios_apps))

In case of Android apps, 93.98523985239852% of them are free!
In case of IOS apps, 66.47214116993192% of them are free!


#### `Conclusion: I deduce by data analysis that Android platform has more percentage of free apps than IOS`

### Removing duplicated entries

I won't count each app more than once when I analyze data to help developers understand what kinds of apps are likely to attract more users.

#### Part 1: Check if there is duplicated entries

I will check first if there is any column that helps me quickly see if there are duplicated entries.

In [24]:
print('Android Columns')
print('===============')
print(android_apps_header)
print('\n')
print('IOS Columns')
print('===========')
print(ios_apps_header)

Android Columns
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


IOS Columns
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In case of __Android__ I can use the _App_ column

In [27]:
repeated_rows = 0
rows_dic = []

for app in free_android_apps:
    app_name = app[0] # this is App column
    if app_name in rows_dic:
        repeated_rows += 1
    else:
        rows_dic.append(app_name)

print('repeated rows in android apps: {}'.format(repeated_rows))

repeated rows: 1138


Now, I will check on IOS apps using the _id_ column

In [30]:
repeated_rows = 0
rows_dic = []

for app in free_ios_apps:
    app_name = app[0] # this is id column
    if app_name in rows_dic:
        repeated_rows += 1
    else:
        rows_dic.append(app_name)

print('repeated rows in ios apps: {}'.format(repeated_rows))

repeated rows in ios apps: 0


#### `Conclusion: I only have to remove rows from Android apps`

#### Part 2: Remove duplicated entries from Android Apps

To create a unique list of apps from Android list I will take into account the _name_ column and the _last updated_ column. The purpose of this is to create a new list of unique apps where I can find which of those are more attractive to the users. _Last updated_ column is important because I will take more into account those apps that have recently updated. 

In [38]:
from datetime import datetime

free_ios_unique_apps = free_ios_apps #IOS does not have duplicated entries

# remove duplicated entries from Android
# first, sort Android apps by date
free_android_apps.sort(key = lambda x: datetime.strptime(x[10], '%d %b %Y'))

# create a dictionary
unique_apps_dic = []
free_android_unique_apps = []

# add unique apps to a list
for index, app in enumerate(free_android_apps):
    name = app[0] # app name column
    last_updated = app[10] # last updated column
    if name not in unique_apps_dic:
        unique_apps_dic.append(name)
        free_android_unique_apps.append(app)


ValueError: time data 'January 7, 2018' does not match format '%d %b %Y'