# Profitable App Profiles for the App Store and Google Play Markets

In this project, I am working as DATA ANALYST for a company that builds Android and iOS mobile apps. My job is to enable the company to make data-driven decisions with respect to the kind of apps they build.

At this company, they only build free apps to download and install, and the main source of revenue consists of in-app ads. This ads revenue is mostly influenced by the number of users that use the app. My goal for this project is to analyze data to help developers understand what kinds of apps are likely to attract more users.

## Opening and analyzing data

Collecting data of all the store apps requires a significant amount of time and money, so I'll try to analyze a sample of data instead. To avoid spending resources (time and/or money), I will use two data sets that seem suitable for our purpose:

* A [Google Play](https://www.kaggle.com/lava18/google-play-store-apps) and an [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) Kaggle dataset suitable for this project with 17k apps to analyze.

### Opening datasets

Let's start by opening the two data sets and then continue with exploring the data.

In [1]:
from csv import reader

# returns the csv data as a list
def open_dataset(path):
    opened_file = open(path)
    readed_file = reader(opened_file)
    data_list = list(readed_file)
    return data_list

In [6]:
# android apps
android_list = open_dataset('googleplaystore.csv')
android_apps_header = android_list[0]
android_apps = android_list[1:]

# ios apps
ios_list = open_dataset('applestore.csv')
ios_apps_header = ios_list[0]
ios_apps = ios_file[1:]

### Exploring data

To make it easier to explore the two data sets, I'll write a function named __explore_dataset()__ that I can use repeatedly to explore rows in a more readable way

In [24]:
def explore_dataset(dataset, start, end=0):
    if end == 0:
        end = start + 1
        
    dataset_slice = dataset[start:end]

    for row in dataset_slice:
        print(row)

Also, I will create another function named __dataset_size()__ that shows me the rows and columns length

In [25]:
def dataset_size(dataset):
    dataset_rows = len(dataset)
    dataset_columns = len(dataset[0])
    print('number of rows:', dataset_rows)
    print('number of columns:', dataset_columns)

In [35]:
print('Android Dataset')
print('================')
print('\n')

explore_dataset(android_apps, 0, 2)
print('\n')
dataset_size(android_apps)
print('\n')

Android Dataset


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


number of rows: 10841
number of columns: 13


In [36]:
print('IOS Dataset')
print('===========')
print('\n')

explore_dataset(ios_apps, 0, 2)
print('\n')
dataset_size(ios_apps)
print('\n')

IOS Dataset


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


number of rows: 7197
number of columns: 16


### Results

The data shows 10841 Android apps and 7197 iOS apps inside datasets

## Cleaning Data

### Removing rows with errors or missing data

It is important to read [Android Dataset Documentation](https://www.kaggle.com/lava18/google-play-store-apps) and [App Store Documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) to check information about the columns and an explanation of how data is collected and saved.  

The Google Play dataset has a dedicated discussion section, and we can see that [one](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) of the discussions outlines an error for row 10472. Print this row and compare it against the header and another row that is correct.

In [43]:
print('Wrong row!')
print('==========')
explore_dataset(android_apps, 10472)

print('Header of the Android dataset')
print('=====================')
print(android_apps_header)
print('\n')

print('Correct row')
print('============')
explore_dataset(android_apps, 1500)

Wrong row!
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Header of the Android dataset
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Correct row
['Zumper - Apartment Rental Finder', 'HOUSE_AND_HOME', '4.4', '11200', '25M', '1,000,000+', 'Free', '0', 'Everyone', 'House & Home', 'July 16, 2018', '4.5.15', '5.0 and up']




The row **10472** corresponds to the app *Life Made WI-Fi Touchscreen Photo Frame*, and I can see that the rating is __19__. This is clearly off because the maximum rating for a Google Play app is 5. As a consequence, I'll delete this row.

In [45]:
dataset_size(android_apps)
del android_apps[10472]  # don't run this more than once to del the row
dataset_size(android_apps)

number of rows: 10841
number of columns: 13
None
number of rows: 10840
number of columns: 13
None


### Removing duplicated rows

... to be continued