# Profitable App Profiles for the App Store and Google Play Markets
In this project, we look at apps available on the Google Play Store and the App Store.  

The goal of this project is to analyze and gauge what type of apps are more likely to attract users.

# Opening and Exploring the Data
Both data sets are available on Kaggle and can be download through these links:

* [iOS](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Android](https://www.kaggle.com/lava18/google-play-store-apps)

In [1]:
# import googleplaystore.csv
from csv import reader
open_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(open_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

# import AppleStore.csv
open_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(open_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

We use the `explore_data()` function to go through the data set and print each row. Also, there is also an option to show the number of rows and columns for any data set.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The Google Play Store data set has 10841 rows and 13 columns. The columns that may be useful for analysis are: `App`, `Category`, `Rating`, `Size`, `Installs`, `Price` and `Genres`.

In [4]:
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


The App Store data set has 7197 rows and 17 columns. The columns that may be useful for analysis are: `track_name`, `size_bytes`, `price`, `user_rating` and `prime_genre`.

# Deleting Incorrect Data

After finding out the number of rows, columns, and general structure of the data sets, we need to know if there are any incorrect data in the data set prior to doing analysis.

Based on the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) at Kaggle, there is one row that has missing data (index 10472 or 10473 depending on whether header is excluded).

The `for loop` below is used to verify this by comparing the length of the header and the length of each row. It prints out the row which has length that is not the same as the header.

In [5]:
# to check for incorrect data in Android data set
for row in android:
    if len(row) != len(android_header):
        print(row)
        print(android.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [6]:
print(android_header)
print('\n')
print(android[10471])
print('\n')
print(android[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row 10472 corresponds to the _Life Made WI-Fi Touchscreen Photo Frame_ app which has the value _1.9_ as category which is incorrect. This causes a shift in the row and subsequently causes other discrepancies i.e. rating becomes 19 (the maximum rating for a Google Play app is 5). Hence, we'll delete this row.

In [7]:
del android[10472]

In [8]:
# to check for incorrect data in iOS data set
for row in ios:
    if len(row) != len(ios_header):
        print(row)
        print(ios.index(row))

print('No missing data')

No missing data


The same steps are repeated for the App Store data set and no missing data was found.

# Deleting Duplicate Data

## Part One

The next step to data cleaning is to find and remove duplicate entries. This is performed so that we don't count apps more than once when we analyze the data.

For this purpose, we make use of two lists:
1. `unique_apps`: stores rows that contain unique app names
2. `duplicate_apps`: stores rows that are duplicates. One instance of each app in this list is in the `unique_apps` list. 

In [23]:
# to check for duplicates in Google Play Store data set
unique_apps = []
duplicate_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


The Google Play Store data set has a total of 1181 duplicates. Some examples of the duplicated rows include Google My Business, Google Ads and Slack.

We could randomly remove the duplicate entries. However, it is better to choose the data to keep using a criterion. In the example below, the difference is on the fourth position which is the `Reviews` column. The different numbers show that the data was collected at different times.

Let's use this as a criterion to keeping rows. We will keep the rows with highest reviews because the higher the reviews, the more reliable the ratings. 

In [26]:
for app in android:
    name = app[0]
    if name == 'Google Ads':
        print(app)

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


In [None]:
#To remove duplicate entries

In [13]:
# to check for duplicates in App Store data set
#unique_apps = []
#duplicate_apps = []

#for app in ios:
#    name = app[0]
#    if name in unique_apps:
#        duplicate_apps.append(name)
#    else:
#        unique_apps.append(name)
        
#print('Number of duplicate apps:', len(duplicate_apps))
#print('Examples of duplicate apps:', duplicate_apps[:10])

Number of duplicate apps: 0
Examples of duplicate apps: []


The App Store data set doesn't have any duplicate data.