# Profitable App Profiles on the AppStore and Google Play Markets
In this project we are trying to understand what types of apps are most attractive to mobile users


In [21]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [22]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [23]:
explore_data(android, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


In [24]:
explore_data(ios, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [25]:
print(android_header, "\n", ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In the ios dataset I think the most influential variables are user_rating, price, prime_genre. Some of the tougher variables to parse their impact are going to be ratings per version, device support, and language support

The android dataset is a little less verbose but I think the same main variables are going to offer the most insight. These are category, reviews, and price
## Now let's look into the data and see if there are any missing or incorrect values
Our goal is to find the profiles that fit english speaking free app purchasers. As such we can cut out any cases that are of apps for non english speaking countries as their demographic is different.

In [26]:
# lets write a function to check for inconsistencies in our data
def badEntry(data, headerLen):
    badEntries = {}
    for row in data:
        rowLen = len(row)
        if rowLen != headerLen:
            badEntries[data.index(row)] = row
    return badEntries

In [27]:
# check if there are any missing values
missingAndroid = badEntry(android, len(android_header))
print(missingAndroid)
# row 10472 is missing some info we could go to the play store and insert this info but for this case we'll just delete it
del android[10472]
print(len(android))

{10472: ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']}
10840


In [28]:
# now let's check the ios dataset
missingIos = badEntry(ios, len(ios_header))
print(missingIos)

{}


Now we need to look for duplicates. I think there are going to be at least a few since this data includes different versions of apps and I think it might treat some as an entirely different app.

In [29]:
# let's write a function to check for our duplicates
def duplicateDetector(data):
    duplicates = {}
    duplicateList = []
    for row in data:
        name = row[0]
        if name in duplicates:
            duplicates[name] += 1
            duplicateList.append(name)
        else:
            duplicates[name] = 1
    return duplicateList
            

In [30]:
androidDuplicates = duplicateDetector(android)
iosDuplicates = duplicateDetector(ios)
print("Google Play duplicates:", len(androidDuplicates),androidDuplicates[:3], "\n","IOS Duplicates", len(iosDuplicates), iosDuplicates[:3])


Google Play duplicates: 1181 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business'] 
 IOS Duplicates 0 []


As we can see there are a lot of duplicate entries in the android dataset but none in the ios dataset. Now we need to come up with a way of choosing the entry we want to keep. We could go about doing this in a few ways.<br>
1. we could simply use the most recent version of the app as that is hopefully the best version yet.
2. we could use the entry with the most reviews as that might indicate it is the best version of the app.
3. we could do something more mathematical like adding together all duplicate entries and taking the average of all the reviews.
  
The third method does offer some good benefits especially if we were planning to use any predictive modeling on this dataset. I think the simplest and most effective method would be to use number two or keep the entry with the most reviews.

In [31]:
# we're going to do this by making a few functions to help us first
# the first function is going to find apps with the most reviews
def maxReviews(data):
    reviews = {}
    for app in data:
        name = app[0]
        nReviews = float(app[3]) #coerce to float since some are chars
        if name in reviews and reviews[name] < nReviews:
            #check if name already in reviews and check if the review count is less than the current review count
            reviews[name] = nReviews
        elif name not in reviews:
            reviews[name] = nReviews
    return reviews

In [34]:
# now we create the cleaning function using our chosen duplicate removal method
def duplicateRemover(data, maxReviews):
    clean = []
    duplicates = []
    # we'll use two lists in this case since we want our output to be an array anyway
    # the reason we need the duplicates list is because there may be some apps that have the same number of reviews for 
    # multiple entries
    for app in data:
        name = app[0]
        reviews = float(app[3])
        if(maxReviews[name] == reviews) and (name not in duplicates):
            clean.append(app)
            duplicates.append(name)
    return clean

In [36]:
reviewList = maxReviews(android)
androidClean = duplicateRemover(android, reviewList)
explore_data(androidClean, 0, 3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Apps that don't fit our profile
Our fictional company operates in primarily english speaking markets. As such we're gonna want to gather data and insights for apps that are built to target those demographics to have the most relevant information. To make sure our dataset is geared towards that we're going to remove any apps that have names with non english characters in them. 

In [39]:
# this function iterates through a string and checks if there are three or more non english ascii chars
# we use three so as to avoid dropping any apps that include emojis or other symbols in their name
def isEnglish(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True
# this is not a perfect implementation something using regex would probably be the most effective at filtering accurately
# for this use case it is enough
del is_english

In [42]:
androidEnglish = []
iosEnglish = []

for app in androidClean:
    name = app[0]
    if isEnglish(name):
        androidEnglish.append(app)
        
for app in ios:
    name = app[1]
    if isEnglish(name):
        iosEnglish.append(app)
        
explore_data(iosEnglish, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


## Remove any paid apps
Our final data cleaning step is to remove any paid apps from our dataset. This is because our company only makes free apps and we want to create the most relevant profile to understand what makes a great free application.

In [43]:
androidFinal = []
iosFinal = []

for app in androidEnglish:
    price = app[7]
    if price == '0':
        androidFinal.append(app)
        
for app in iosEnglish:
    price = app[4]
    if price == '0.0':
        iosFinal.append(app)
        
explore_data(androidFinal, 0, 3, True)
print("\n")
explore_data(iosFinal, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

This is the end of our data cleaning process. We ended up with 8864 rows in the android dataset and 3222 rows in the ios dataset. For our purposes of of analysis this is enough to do some EDA and this is large enough if we were to use and hypothesis testing on as well. The only time we may run into issues is if we chose to do any predictive modelling using this data as it may be on the smallerside to create a statistically significat model out of.

## Analysis
### Apps by Genre
Since we plan on putting this app on both the google play store and the app store we need to make sure our app caters to both markets. First we will investigate what are the most common apps by genre in both markets. We'll start by writing a few functions.

In [44]:
def freqTable(data, index):
    table = {}
    total = len(data)
    for row in data:
        value = row[index]
        if value in table:
            table[value] +=1
        else:
            table[value] = 1
    percentTable = {}
    for key in table:
        percent = (table[key] / total) * 100
        percentTable[key] = percent
    return percentTable

def printTable(data, index):
    table = freqTable(data, index)
    displayTable = []
    for key in table:
        keyVal = (table[key], key)
        displayTable.append(keyVal)
    sortedTable = sorted(displayTable, reverse = True)
    for row in sortedTable:
        print(row[1], ":", row[0])

In [None]:
# now let's go through the 