# Features of Profitable Apps for the App Store and Google Play Markets

In this project, I act as a data analyst working for a company that builds Android and iOS mobile apps. Our company only builds apps that are free to download and install. Therefore, our main source revenue consists of in-app ads. This means that our revenue mostly relies on the number of users who use our app and engage with the ads. 

My goal in this project is to analyze data to help our developers understand what type of apps are likely to attract more users in the App Store and Google Play Markets.

In [18]:
from csv import reader

## Apple Store Data Set ##
open_file1 = open('AppleStore.csv')
read_file1 = reader(open_file1)
ios = list(read_file1)
ios_header = ios[0]
ios = ios[1:]

## Google Play Store Data Set ##
open_file2 = open('googleplaystore.csv')
read_file2 = reader(open_file2)
android = list(read_file2)
android_header = android[0]
android = android[1:]

## Opening and Exploring Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store and 2.1 million Android apps on Google Play.

Running an analysis on 4 million apps requires significant amount of time and money. For that reason, we will work on a sample instead. To avoid spending resources with collecting new data ourselves, we should first try to see whether we can find any relevant existing data at no cost. Luckily, we have two data sets that seem suitable for that purpose:

* [Android Data Set](https://www.kaggle.com/lava18/google-play-store-apps) contains data about approximately ten thousand apps from Google Play.
* [iOS Data Set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) contains data about approximately ten thousand apps from the App Store.

Let's start by open the two data sets and continue with exploring them.

In [19]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') #adds a new (empty) line after each row
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

To make it easier to explore the two data sets, we'll first write a function named `explore_data()` that we can use repeatedly to explore rows in a more readable way. We'll also add an option for our function to show the number of rows and columns for any data set.

In [20]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We see that there are 7197 apps and 16 columns in iOS Data Set. At a quick glance, the columns that might be useful for the purpose of our analysis are `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`.

Now let's look at the Google Play Data Set.

In [21]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


We have 10841 apps in this data set, and the columns that we are interested in are `'App'`, '`Category'`, `'Reviews'`, `'Installs'`, `'Type'`, `'Price'`, and `'Genres'`.

## Deleting Wrong Data

The Google Play has a dedicated [discussion session](https://www.kaggle.com/lava18/google-play-store-apps/discussion) and we can see that [one of the sessions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. Let's print that row and compare it against the header and another row that is correct.

In [22]:
print(android_header)
print('\n')
print(android[0])
print('\n')
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row 10472 corresponds to the app `Life Made WI-Fi Touchscreen Photo Frame` and we can see that the rating is 19. This is clearly off because the maximum rating for Google Play app is 5. Therefore, we will delete this row.

In [23]:
print(len(android))
del android[10472] #don't run this more than once
print(len(android))

10841
10840
