# Profitable App Profiles for the App Store and Google Play Markets

The goal of this project is to analyse the data of the ___ dataset to help developers understand the type of apps that are likely to atract more users.

# 1. Opening and Exploring the Data

We are going to explore the 'Google Play Store Apps' and the 'Mobile App Store ( 7200 apps)' datasets.

In [16]:
# First we open the two datasets.

open_apple = open('AppleStore.csv')
open_google = open('googleplaystore.csv')

from csv import reader
read_apple = reader(open_apple)
read_google = reader(open_google)

apps_data_apple = list(read_apple)
apps_data_google = list(read_google)

In [17]:
# Function to print rows of a given dataset in a readable way -- provided by dataquest
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [18]:
# Printing the Column names of both datasets using the explore_data function
explore_data(apps_data_apple, 0, 2)
explore_data(apps_data_google,0,1)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




## Data Cleaining

Removing the wrong data and duplicate data.
Modifying the data to fit the purpose of the analysis

In [19]:
# This row has an error
explore_data(apps_data_google,0,1)
explore_data(apps_data_google,10473 ,10475)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [20]:
# We delete it and print the dataset again to see it prints the next row
del apps_data_google[10473]
explore_data(apps_data_google,10473 ,10475)

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


['Sat-Fi Voice', 'COMMUNICATION', '3.4', '37', '14M', '1,000+', 'Free', '0', 'Everyone', 'Communication', 'November 21, 2014', '2.2.1.5', '2.2 and up']




In [21]:
# We do the same with row 9748
explore_data(apps_data_google,9148 ,9150) 

['Plants vs. Zombies™ 2', 'FAMILY', '4.4', '567632', '15M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Casual', 'June 12, 2018', '6.8.1', '4.1 and up']


['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']




In [22]:
del apps_data_google[9149]
explore_data(apps_data_google,9148 ,9150) 

['Plants vs. Zombies™ 2', 'FAMILY', '4.4', '567632', '15M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Casual', 'June 12, 2018', '6.8.1', '4.1 and up']


['Star Wars™: Galaxy of Heroes', 'FAMILY', '4.5', '1461698', '67M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Role Playing', 'May 21, 2018', '0.12.334385', '4.1 and up']




## Deleting the duplicate entries. 
The Google Play dataset contains duplicate entries as we can see in the bellow cell. Let's find out the number of duplicate app entries in the Google PLay dataset.

In [23]:
duplicate_apps = []
unique_apps = []

for app in apps_data_google:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps: ', len(duplicate_apps), '\n')
print('Examples of duplicate apps: ', duplicate_apps[:10])

Number of duplicate apps:  1181 

Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']



We are going to remove the duplicates in order to have one entry per app but this won't be a randomized process. We will keep the entry with the highest number of reviews as this is an indication that this entry is the most recent one.

To do this we will create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app. The information stored in the dictionary will be used to create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [24]:
reviews_max = {}

for row in apps_data_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9658


In [25]:
android_clean = []
already_added = []

for row in apps_data_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))

9658


## For the Apple Store Dataset

In [27]:
print(len(apps_data_apple))

reviews_max_apple = {}

for row in apps_data_apple[1:]:
    name = row[1]
    n_reviews = float(row[5])
    if name in reviews_max_apple and reviews_max_apple[name] < n_reviews:
        reviews_max_apple[name] = n_reviews
    if name not in reviews_max_apple:
        reviews_max_apple[name] = n_reviews
        
print(len(reviews_max_apple))

apple_clean = []
already_added_apple = []

for row in apps_data_apple[1:]:
    name = row[1]
    n_reviews = float(row[5])
    if (n_reviews == reviews_max_apple[name]) and (name not in already_added_apple):
        apple_clean.append(row)
        already_added_apple.append(name)
print(len(apple_clean))

7198
7195
7195
