# Profitable App Profile for the Apple Store and Google Play Markets

This project is to find the characteristics of mobile appplication that are prefreable in Apple Store and Google Play Store. As a data analyst who work in company that builds Android and iOS mobile apps, the job is to help the developers team to make a decision based on data-driven approach for decided what kind of apps that they should build. 

The company only build apps that are free to download and install and main source of revenue consites of in-app adds. This means our revenue for any given app is mostly influenced by the number of users who use our app. Our goal for this project is to analyze data to help our developers understand what kind of apps are likely to attract more users.

## Opening and Exploring the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store and 2.1 million Android apps on Google Play (source from: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/))

Collectiong data over 4 million apps requires a significant amount of time and money, so try to analyze a sample of the data instead. To avoid spending resources on collectiong new data ourselves try to see any relevant existing data without cost. There are two datasets below seem suitable for the goals:

* [Google Apps Store Datasets](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately **10.000** Android apps from Google Play that collected in August 2018.
* [Apple iOS Apps Store Datasets](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately **7.000** iOS apps from the App Store that collected in July 2017.

Lets start by opening two data sets and then continue with exploring the data.

In [14]:
## OPEN TWO DATA SETS#

from csv import reader

#------- Google App Store Data sets------#
open_android = open('googleplaystore.csv')
android_file = reader(open_android)
android_lists = list(android_file)
android_header = android_lists[0]
android_data = android_lists[1:]

#------- Apple iOS App Store Data sets------#
open_ios = open('AppleStore.csv')
ios_file = reader(open_ios)
ios_lists = list(ios_file)
ios_header = ios_lists[0]
ios_data = ios_lists[1:]

In [15]:
# EXPLORE TWO DATA SETS #
def explore_data(lists, header=True):
    for row in lists[:5]:
        print(row,'\n')
        
    if header==True:
        print('Number of rows: ',len(lists[1:]))
        print('Number of columns: ',len(lists[0]))
    else:
        print('Number of rows: ',len(lists[1:]))

In [16]:
print('Android Data Explore: ')
android_explore = explore_data(android_lists, header=True)

Android Data Explore: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up'] 

['U Launcher Lite â€“ FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'] 

['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'] 

Number of rows:  10841
Number of columns:  13


As we can see from the result, for Android Dataset has **13 feature** and **10841 data**. The features that might be useful for our goals are:

| <center> Column Name </center> | <center> Description </center>                                |
| ----------- | ------------------------------------------- |
| App         | Application Name                            |
| Category    | Category the app belongs to                 |
| Rating      | Overall user rating of the app              |
| Reviews     | Number of user reviews                      |
| Installs    | Number of user download or install the app  |
| Type        | a paid app or free app                      |
| Genres      | Apart from the main category                |
| Price       | Price of the app                            |

In [17]:
print('iOS Data Explore: ')
ios_explore = explore_data(ios_lists,True)

iOS Data Explore: 
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'] 

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'] 

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'] 

['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1'] 

Number of rows:  7197
Number of columns:  16


For iOS data sets has **16 features** and **7197 data**. The features that might be useful are:

| Column Name       |  Description                             |
| -----------       | -----------------------------------------|
| id                | Application ID                           |
| track_name        | Application Name                         |
| price             | price amount                             |
| rating_count_tot  | Number of user rating count for all ver  |
| rating_count_ver  | Number of user rating count for cur ver  |
| prime_genre       | primary genre                            |
| currency          | Currency Type                            |

## Data Cleaning

### Detecting and Deleting Wrong Data

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and [one of the discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.



In [26]:
print('Wrong data: ',android_data[10472],'\n')
print('List of Features: ',android_header,'\n')
print('Example of right data: ',android_data[0])

Wrong data:  ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

List of Feature:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Example of right data:  ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


It is right from the discussion, that the data in index:**10472** (without header) is a wrong data because **Content Rating** feature has a blank value. Also, as we can see the **Rating** of that app reach **19**, and it is clearly off because the maximum rating for a Google Play App is a **5**.

So, need to delete the row of that wrong data from the Google Play data set.

In [27]:
del android_data[10472]
print(len(android_data))

10840


### Removing Duplicate Entries