# Profitable App Profiles for the App Store and Google Play Markets

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

What will be done in this project is to analyze whether a free application can generate profits or not? The applications analyzed are sourced from the App Store and Google Play. What do we do in this project is a free whether analysis app in Google Play and App Store is profitable.

At the end of the project, we want to make developers understand what types of apps are likely to attract more users.

# Data

## Acquisition

Collect and analyze data about mobile apps available on Google Play and the App Store

[A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

[A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

# Importing

First step, reader function is imported from csv modele, then assign it to variable.

In [1]:
from csv import reader

# Open AppleStore.csv as appstore
# To prevent UnicodeDecodeError, add encoding="utf8" to the open() function
#
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
appstore = list(read_file)
appstore_header = appstore[0]
appstore = appstore[1:]

# Open googleplaystore.csv as gplay
#
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
gplay = list(read_file)
gplay_header = gplay[0]
gplay = gplay[1:]

## Exploring

In [2]:
# We'll start by opening and exploring these two data sets. 
# To make them easier for you exploring it,
# we created a function named explore_data()
#
def explore_data(dataset, start, end, rows_and_columns=False):
    ''' 
    dataset, expected to be a list of lists.
    start and end, expected to be integers and represent the starting and the ending indices of a slice from the data set.
    rows_and_columns, expected to be a Boolean and has False as a default argument.
    '''
    
    dataset_slice = dataset[start:end] # slice the dataset    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset)) # will return row(s) count
        print('Number of columns:', len(dataset[0])) # will return column(s) count

### Exploring Apple Store Dataset

Using explore_data() above, we'll print the column names and first three rows of Apple Store dataset.

In [3]:
print(appstore_header)
print('\n')
explore_data(appstore, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


### Exploring Google Play Dataset

Using explore_data() above, we'll print the column names and first three rows of Google Play dataset.

In [4]:
print(gplay_header)
print('\n')
explore_data(gplay, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Cleaning

### Column Length Checking

Because there is [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) that one of the data column length is not same as header column length. Let's we check using this loop' how to find the mismatch row.

In [5]:
print ("Apple Store mismatch result :")
for row in appstore:
    if len(row) != len(appstore_header):
        print("Index :", appstore.index(row))
        print(row)
print("\n")
print("Google Play mismatch result :")        
for row in gplay:
    if len(row) != len(gplay_header):
        print("Index :", gplay.index(row))
        print(row)

Apple Store mismatch result :


Google Play mismatch result :
Index : 10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


#### Deleting Wrong Data

We didn't found mismatch from App Store data, But in Google Play dataset we found one row that not have same length with it's header. We need delete row[10472] from headerless data.

In [6]:
print("Old length :", len(gplay))

# deleting the mismatch row[10472]
#
del gplay[10472]
print("New length :", len(gplay))

# Re-check 
#
print("\nGoogle Play mismatch result :")        
for row in gplay:
    if len(row) != len(gplay_header):
        print("Index :", gplay.index(row))
        print(row)

Old length : 10841
New length : 10840

Google Play mismatch result :


We have deleted mismatch row and double checking it. Good...! Zero result.

### Duplicate Checking

From the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, the Google Play have duplicate entries.

In [7]:
duplicate_apps = []
unique_apps = []

for app in gplay:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:\n', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps:
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

#### Removing Duplicate Entries

In [11]:
reviews_max = {}

for app in gplay:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    else:
        reviews_max[name] = n_reviews

Inspect the dictionary to make sure everything went as expected. Measure the length of the dictionary — remember that the expected length is 9,659 entries.

In [13]:
print('Expected length: ', len(gplay) - 1181)
print('Actual length: ', len(reviews_max))

Expected length:  9659
Actual length:  9659
