# Profitable App Profiles for the App Store and Google Play Markets #

The goal of this project is to evaluate apps from both the iOS App Store and Google Play markets to determine which apps are most likely to be profitable for future development efforts. The assumption for this project is that this data analysis work is for a company that develops iOS and Android mobile apps that are free, with revenue coming primarily from in-app advertising.

* A selection of data for approximately 10,000 Android apps from Google Play was obtained from this [resource](https://www.kaggle.com/lava18/google-play-store-apps).

* A selection of data for approximately 7,000 iOS apps from the App Store was obtained from this [resource](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).


## Opening the Data Files and Initial Inspection ##

In [1]:
# Import Android apps csv data and create list of lists
from csv import reader

open_android = open("googleplaystore.csv", encoding="utf8")
read_android = reader(open_android)
android_appdata = list(read_android)
android_header = android_appdata[0]
android_appdata = android_appdata[1:]

# Import iOS apps csv data and create list of lists
open_apple = open("AppleStore.csv", encoding="utf8")
read_apple = reader(open_apple)
apple_appdata = list(read_apple)
apple_header = apple_appdata[0]
apple_appdata = apple_appdata[1:]


The explore_data function will allow us to look at each dataset and specify how many rows of data we are interested in viewing. It also allows for the optional printing of the number of columns and rows in the dataset.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let's look at the Android data first:

In [3]:
explore_data(android_appdata,1,3,True)

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Now we'll look at the same information from the Apple App Store data:

In [4]:
explore_data(apple_appdata,1,3,True)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In order to get an idea of which columns of data may prove most useful for analysis, we print the header files below:

In [5]:
print(android_header)
print("\n")
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


For Android apps, the following columns appear to be relevant for our purposes: App, Category, Rating, Reviews,Installs, Type, Price, and Genre.

For Apple apps, we might be interesting in: id, track_name, price, rating_count_tot, user_rating, and prime_genre.

## Removing Wrong Data ##

The [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section for the Android data indicates that there is an error with one of the rows of data.

Below, we look at the row in question, with the header row and another (correct) row for comparison.

In [6]:
print(android_appdata[10472])
print("\n")
print(android_header)
print("\n")
print(android_appdata[3])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


The "Life Made WI-Fi Touchscreen Photo Frame" can be seen to have a rating of 19, which is clearly an error as it is outside the possible range of ratings values. Further examination reveals that the Category for this app is missing, which has shifted the remaining values to the left.

In the next step, we delete the incorrect row.

In [7]:
print(len(android_appdata))
print("\n")
del android_appdata[10472]   # Do not run this code more than one time!
print("\n")
print(len(android_appdata))

10841




10840


## Removing Duplicate Entries ##

We will now examine the Google Play data to see if there are duplicate entries. If present, we will determine a method for eliminating the duplicates.

In [8]:
duplicate_apps = []
unique_apps = []

for app in android_appdata:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print("Number of duplicate apps: ", len(duplicate_apps))
print("\n")
print("Examples of duplicate apps: ", duplicate_apps[:7])
    

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits']
