## Profiles of Apps that are profitable in the App Store and Google Play Markets

The following project will attempt to analyze data to help App developers better understand what kinds of apps are likely to attract more users. The aim of this project is to find mobile apps that are profitable in the App Store or the Google Play markets.

I am specifically focusing on apps that generate revenue via in-line advertisements and not through subscription or purchases. Meaning the apps that this data analysis will be meant for are companies building applications that are free to download and free to use - generating revenue from advertisements.

I'm also doing this project without the use of any non-built in Python modules - knowing that if I were indeed to do this with Pandas / Numpy, it might be easier. But we're going to do this project using purely built in Python data structures and simple algorithms.

# Discovering the Data

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

![Image](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png)

Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

Since there are over 4 million apps available in these two stores, we will be using a sample of the data. There are two data sets that we can use for our goals:  
- This [data set from Kaggle](https://www.kaggle.com/lava18/google-play-store-apps/home) contains data from a sample of 10,000 Android apps from Google play - the data was collected in August 2018.  
- This [data set also from Kaggle](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) contains data of ~7,000 iOS apps collected in July 2017.

Let's open the data and save it as a list of lists:

In [1]:
from csv import reader

# opening the data files with proper encoding
openapple = open('AppleStore.csv', encoding="utf8")
opengoogle = open('googleplaystore.csv', encoding="utf8")

# using CSV reader to read the files in
readapple = reader(openapple)
readgoogle = reader(opengoogle)

# using built in List function to convert it into a list of lists
data_apple = list(readapple)
data_google = list(readgoogle)

# Our Final Google Data:
google = data_google[1:]
google_header = data_google[0]

# Our Final Apple Data:
apple = data_apple[1:]
apple_header = data_apple[0]

Next, I define a function that will allow us to print the rows in a readable way:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Let's use the above function to explore the data.

In [3]:
print(apple_header)
print()
explore_data(apple, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


As you can see, iOS store data has 7197 rows of data across 16 columns. Columns of note that might come in handy: `'track_name'` (index=1), `'price'` (index=4), `'rating_count_tot'` (index=5), `'rating_count_ver'` (index=6), `'user_rating'` (index=7), `'user_rating_ver'` (index=8), `'prime_genre'` (index=11)

More details about the column labels can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).

Now, let's look at the Google Play Store Data:

In [4]:
explore_data(data_google, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


Google store data has 10842 rows of data across 13 columns. Columns of note that might come in handy: `'App'` (index=0), `'Category'` (index=1),  `'Rating'` (index=2), `'Reviews'` (index=3), `'Price'` (index=7), `'Genres'` (index=9)

More details about the dataset can be found [here](https://www.kaggle.com/lava18/google-play-store-apps/home).

# Data Cleaning p1 - Deleting Outliers


First we will run through all the data to delete any data that has a rating above 5.0 or below 0.0. Maximum ratings in each of the stores is 5.0 and minimum ratings are 0.0.


In [5]:
for i, app in enumerate(apple):
    if not (0 <= float(app[7]) <= 5.0):
        del apple[i]
        
for i, app in enumerate(google):
    if not (0 <= float(app[2]) <= 5.0):
        del google[i]

Now, let's see if we deleted any outliers by counting the number of rows after the above was finished.

In [6]:
explore_data(apple, 0, 0, True)
explore_data(google, 0, 0, True)

Number of rows: 7197
Number of columns: 16
Number of rows: 9756
Number of columns: 13


The number of rows for a the Apple Store remains the same at 7197.
However, we can see that the Google store had a lot of outliers, and the number of rows went from **10842** to **9436** !

# Data Cleaning p2 - Removing Duplicates

Now, we will loop through and find any apps that are not unique (using their app name). After we find all the duplicates, we will only keep the one with the highest number of reviews (since the higher the number of reviews, the more dependable, the average review would be).

How will we do this? 

First, let's create dictionaries with keys being the name of the apps, and values being the number of times the name of that app exists. That will tell us which apps we'll need to compare.

In [7]:
apple_dict = {}
google_dict = {}

for app in apple:
    if app[1] in apple_dict:
        apple_dict[app[1]] += 1
    else:
        apple_dict[app[1]] = 1

for app in google:
    if app[0] in google_dict:
        google_dict[app[0]] += 1
    else:
        google_dict[app[0]] = 1

Now, let's display the apps that have duplicates:

In [8]:
print('iOS / Apple Duplicates:')
print('-----------------------')

for app in apple_dict:
    if apple_dict.get(app) > 1:
        print(app, apple_dict.get(app))

print()
print('Google Store Duplicates:')
print('-----------------------')

for app in google_dict:
    if google_dict.get(app) > 1:
        print(app, google_dict.get(app))

iOS / Apple Duplicates:
-----------------------
Mannequin Challenge 2
VR Roller Coaster 2

Google Store Duplicates:
-----------------------
Coloring book moana 2
UNICORN - Color By Number & Pixel Art Coloring 2
Textgram - write on photos 2
Wattpad 📖 Free Books 2
Amazon Kindle 2
Dictionary - Merriam-Webster 2
NOOK: Read eBooks & Magazines 2
Oxford Dictionary of English : Free 2
Spanish English Translator 2
NOOK App for NOOK Devices 2
Ebook Reader 2
English Dictionary - Offline 2
Docs To Go™ Free Office Suite 2
Google My Business 3
OfficeSuite : Free Office + PDF Editor 2
Curriculum vitae App CV Builder Free Resume Maker 2
Facebook Pages Manager 2
Box 3
Call Blocker 2
ZOOM Cloud Meetings 2
Facebook Ads Manager 2
Quick PDF Scanner + OCR FREE 3
SignEasy | Sign and Fill PDF and other Documents 2
Genius Scan - PDF Scanner 2
Tiny Scanner - PDF Scanner App 2
Fast Scanner : Free PDF Scan 2
Mobile Doc Scanner (MDScan) Lite 2
TurboScan: scan documents and receipts in PDF 2
Tiny Scanner Pro: PDF D

Monster High™ 2
Frozen Free Fall 3
Shopkins World! 2
Masha and the Bear Child Games 2
Chess Free 2
Kids Balloon Pop Game Free 🎈 2
Sounds for Toddlers FREE 2
Elmo Calls by Sesame Street 3
Sago Mini Friends 3
Tee and Mo Bath Time Free 2
Bita and the Animals - Pelos Ares 2
TO-FU Oh!SUSHI 2
DreamWorks Friends 2
Avokiddo Emotions 2
Nighty Night Circus 2
Sago Mini Babies 2
Dr. Panda & Toto's Treehouse 3
Fun Kid Racing - Motocross 2
COOKING MAMA Let's Cook! 2
My Little Pony Celebration 2
Animal Jam - Play Wild! 2
Video Editor 3
Real Racing 3 2
Minecraft 2
Monash Uni Low FODMAP Diet 2
iBP Blood Pressure 2
Pedi STAT 2
ASCCP Mobile 2
Journal Club: Medicine 2
Paramedic Protocol Provider 2
Medical ID - In Case of Emergency (ICE) 2
Human Anatomy Atlas 2018: Complete 3D Human Body 3
Essential Anatomy 3 2
Vargo Anesthesia Mega App 2
EMT Review Plus 2
2017 EMRA Antibiotic Guide 2
IBM Micromedex Drug Info 2
Diabetes & Diet Tracker 2
Block Buddy 2
EMT PASS 2
Cardiac diagnosis (heart rate, arrhythmia) 3


As you can see, we have 2 duplicates in the iOS store data and a whole LOT of duplicates in the Google store data. I've printed out the number of duplicates next to each one - and as you can see, some of them have more than 2 duplicates.

Now, let's use these dictionaries (that have the app names along with the number of times the app appears) along with our original data (mutable list of lists) and remove the duplicates.

In [9]:
# creating new blank lists
apple_v2 = []
google_v2 = []

# doing the Apple list first
for app in apple:
    name = app[1]
    
    # if the final list already has this app name in it, do nothing
    if any(name in sublist for sublist in apple_v2):
        pass
    
    # if the name of the app is not unique
    elif apple_dict.get(name) > 1:
        # creating a temp list of duplicates of this app
        duplicates = [app]
        # looping through the data to find duplicates
        for app2 in apple:
            if name == app2[1]:
                duplicates.append(app2)
        # appending to the final list the duplicate with the max reviews (the # of reviews is at index 5)
        apple_v2.append(max(duplicates, key=lambda x: x[5]))
    
    # else this app is unique
    else:
        apple_v2.append(app)

        
        
# Now, we run through the Google list
for app in google:
    name = app[0]
    
    # if the final list already has this app name in it, do nothing
    if any(name in sublist for sublist in google_v2):
        pass
    
    # if the name of the app is not unique
    elif google_dict.get(name) > 1:
        # creating a temp list of duplicates of this app
        duplicates = [app]
        # looping through the data to find duplicates
        for app2 in google:
            if name == app2[0]:
                duplicates.append(app2)
        # appending to the final list the duplicate with the max reviews (the # of reviews is at index 3)
        google_v2.append(max(duplicates, key=lambda x: x[3]))
    
    # else this app is unique
    else:
        google_v2.append(app)

Now, let's compare the number of rows of data our v2 list has verses our original.

In [10]:
print('Before and After Cleaning Duplicate Apps')
print('Apple:')
explore_data(apple, 0, 0, True)
explore_data(apple_v2, 0, 0, True)
print()
print('Google:')
explore_data(google, 0, 0, True)
explore_data(google_v2, 0, 0, True)

Before and After Cleaning Duplicate Apps
Apple:
Number of rows: 7197
Number of columns: 16
Number of rows: 7195
Number of columns: 16

Google:
Number of rows: 9756
Number of columns: 13
Number of rows: 8584
Number of columns: 13


So we can see that there were apps cleaned up - especially from Google. That's good. Now let's check to see if there are any duplicates left over in our new v2 apps by using dictionaries just like before:

In [11]:
apple_dict_v2, google_dict_v2, = {}, {}

for app in apple_v2:
    if app[1] in apple_dict_v2:
        apple_dict_v2[app[1]] += 1
    else:
        apple_dict_v2[app[1]] = 1

for app in google_v2:
    if app[0] in google_dict_v2:
        google_dict_v2[app[0]] += 1
    else:
        google_dict_v2[app[0]] = 1
        
print('iOS / Apple Duplicates:')
print('-----------------------')

for app in apple_dict_v2:
    if apple_dict_v2.get(app) > 1:
        print(app, apple_dict_v2.get(app))

print()
print('Google Store Duplicates:')
print('-----------------------')

for app in google_dict_v2:
    if google_dict_v2.get(app) > 1:
        print(app, google_dict_v2.get(app))

iOS / Apple Duplicates:
-----------------------

Google Store Duplicates:
-----------------------


Great! There are not any duplicates! Now, one final check. If you recall, if there were duplicates, we chose the app to keep that had the most number of reviews. Let's test that we did in fact do this. If you look up at the duplicate apps, `"Viber Messenger"` had 5 duplicates. Let's go through our original (non-cleaned-up) list and print out the 5 duplicates and see how many reviews they each had.

In [12]:
for app in google:
    if app[0] == 'Viber Messenger':
        print(app)

['Viber Messenger', 'COMMUNICATION', '4.3', '11334799', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11334973', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11334973', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11335255', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']
['Viber Messenger', 'COMMUNICATION', '4.3', '11335481', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']


If our "v2" cleaned-up list was correct, our final `Viber Messenger` that we kept should be the last one with `11335481` reviews. Let's see:

In [13]:
for app in google_v2:
    if app[0] == 'Viber Messenger':
        print(app)

['Viber Messenger', 'COMMUNICATION', '4.3', '11335481', 'Varies with device', '500,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 18, 2018', 'Varies with device', 'Varies with device']


Great! Now, let's move on.

# Data Cleaning p3 - Removing Non-English Apps

Since, we're focusing on apps for english-speakers. There's no use in including apps for non-English speakers in our data-set. one of the ways that we can do to check for non-native apps is by checking any apps that have non-anglicized characters in it by doing string indexing. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, and using the built in [ord()](https://docs.python.org/3/library/functions.html#ord) function, we can build a function that detects whether a character belongs to the set of common English characters or not.

Let's build and test that function:

In [14]:
def english_string(mystring):
    for letter in mystring:
        if not(0 <= ord(letter) <= 127):
            return False
    return True

print(english_string('Instagram'))
print(english_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_string('Docs To Go™ Free Office Suite'))
print(english_string('Instachat 😜'))

True
False
False
False


Our function is not yet very effective since there are often emojis or symbols that fall outside the 0-127 ASCII range as you can see above and we'll be losing valuable data. Let's therefore filter it so that an app will only be labeled as foreign if it has more than 3 characters that fall outside of that range.

In [15]:
def english_string(mystring):
    count = 0
    for letter in mystring:
        if not(0 <= ord(letter) <= 127):
            count += 1
    if count > 3:
        return False
    else:
        return True

print(english_string('Instagram'))
print(english_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_string('Docs To Go™ Free Office Suite'))
print(english_string('Instachat 😜'))

True
False
True
True


Now, let's filter out the two data sets by creating a two new data sets (v3) and using the above function to remove non-english filters. Then, we'll compare the number of rows in our data (v3) vs our previous iteration of data (v2).

In [16]:
# creating new blank lists
apple_v3 = []
google_v3 = []

for app in apple_v2:
    if english_string(app[1]):
        apple_v3.append(app)

for app in google_v2:
    if english_string(app[0]):
        google_v3.append(app)

print('Before and After Cleaning Non-English Apps')
print('Apple:')
explore_data(apple_v2, 0, 0, True)
explore_data(apple_v3, 0, 0, True)
print()
print('Google:')
explore_data(google_v2, 0, 0, True)
explore_data(google_v3, 0, 0, True)

Before and After Cleaning Non-English Apps
Apple:
Number of rows: 7195
Number of columns: 16
Number of rows: 6181
Number of columns: 16

Google:
Number of rows: 8584
Number of columns: 13
Number of rows: 8550
Number of columns: 13


As we can see, both data sets were cleaned up - but there were a lot more iOS / Apple non-english apps than Google ones.

# Data Cleaning p4 - Removing Non-Free Apps

Recall our original plan was to analyze apps that were free. So let's go ahead and create v4 datasets with only free apps in them.

In [17]:
# creating new blank lists
apple_v4 = []
google_v4 = []

# appending free apps to these blank lists
for app in apple_v3:
    # if price is 0.0
    if float(app[4]) == 0.0:
        apple_v4.append(app)
        
for app in google_v3:
    # using this column in our data was more justifiable than the price column, since the price column had $ signs in some of the strings
    if app[6] == 'Free':
        google_v4.append(app)

print('Before and After Cleaning Non-Free Apps')
print('Apple:')
explore_data(apple_v3, 0, 0, True)
explore_data(apple_v4, 0, 0, True)
print()
print('Google:')
explore_data(google_v3, 0, 0, True)
explore_data(google_v4, 0, 0, True)

Before and After Cleaning Non-Free Apps
Apple:
Number of rows: 6181
Number of columns: 16
Number of rows: 3220
Number of columns: 16

Google:
Number of rows: 8550
Number of columns: 13
Number of rows: 7917
Number of columns: 13


Now, we're left with 3220 iOS apps and 7917 Google apps in our data to analyze.

# Analyzing the Data p1 - Frequency Tables

We want to analyze the data to understand our market before we build the app. Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. 

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets. Let's do that by focusing on 'prime_genre' (index 11) for the Apple data set and 'Genres' (index 9) and 'Category' (index 1) in the Google data set.

We'll first create two functions: one that creates frequency tables, and then another that sorts our frequency table in DESCENDING order and displays it.

In [18]:
# creating a function that creates frequency tables:
def freq_table(dataset, index):
    totalapps = len(dataset)
    freq_table = {}
    for app in dataset:
        if app[index] in freq_table:
            freq_table[app[index]] += 1
        else:
            freq_table[app[index]] = 1
    for key in freq_table:
        freq_table[key] = round(freq_table.get(key) / totalapps * 100, 2)
    return freq_table

# this function uses the freq_table function above to create the frequency tables and then displays them in REVERSE SORTED order
def display_table(dataset, index):
    '''the dataset must be a list of lists, the index is an integer that tells us which column we want to focus on'''
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

Analyzing Apple Data set focusing on the **PRIME_GENRE** column

In [19]:
display_table(apple_v4, 11)

Games : 58.14
Entertainment : 7.89
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.52
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.34
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


Analyzing the Google Data set focusing on the **GENRE** column

In [20]:
display_table(google_v4, 9)

Tools : 8.58
Entertainment : 6.06
Education : 5.38
Productivity : 3.81
Lifestyle : 3.8
Finance : 3.79
Business : 3.68
Action : 3.42
Sports : 3.35
Medical : 3.23
Photography : 3.17
Personalization : 3.15
Health & Fitness : 3.07
Communication : 3.07
News & Magazines : 2.69
Social : 2.65
Shopping : 2.31
Travel & Local : 2.3
Simulation : 2.22
Books & Reference : 2.07
Arcade : 1.97
Casual : 1.88
Video Players & Editors : 1.84
Dating : 1.81
Maps & Navigation : 1.48
Food & Drink : 1.25
Puzzle : 1.1
Racing : 1.06
Role Playing : 1.01
Strategy : 0.97
Auto & Vehicles : 0.92
Libraries & Demo : 0.88
Weather : 0.83
House & Home : 0.8
Adventure : 0.75
Art & Design : 0.67
Events : 0.64
Comics : 0.64
Beauty : 0.56
Parenting : 0.48
Card : 0.48
Casino : 0.47
Educational;Education : 0.42
Board : 0.42
Trivia : 0.39
Educational : 0.38
Education;Education : 0.38
Word : 0.29
Casual;Pretend Play : 0.27
Music : 0.21
Puzzle;Brain Games : 0.2
Racing;Action & Adventure : 0.19
Entertainment;Music & Video : 0.19
Cas

Analyzing the Google Data set focusing on the **CATEGORY** column

In [21]:
display_table(google_v4, 1)

FAMILY : 19.39
GAME : 10.52
TOOLS : 8.59
PRODUCTIVITY : 3.81
LIFESTYLE : 3.81
FINANCE : 3.79
BUSINESS : 3.68
SPORTS : 3.27
MEDICAL : 3.23
PHOTOGRAPHY : 3.17
PERSONALIZATION : 3.15
HEALTH_AND_FITNESS : 3.07
COMMUNICATION : 3.07
NEWS_AND_MAGAZINES : 2.69
SOCIAL : 2.65
TRAVEL_AND_LOCAL : 2.31
SHOPPING : 2.31
BOOKS_AND_REFERENCE : 2.07
VIDEO_PLAYERS : 1.87
DATING : 1.81
MAPS_AND_NAVIGATION : 1.48
EDUCATION : 1.3
FOOD_AND_DRINK : 1.25
ENTERTAINMENT : 1.07
AUTO_AND_VEHICLES : 0.92
LIBRARIES_AND_DEMO : 0.88
WEATHER : 0.83
HOUSE_AND_HOME : 0.8
ART_AND_DESIGN : 0.72
COMICS : 0.66
EVENTS : 0.64
PARENTING : 0.62
BEAUTY : 0.56


As you can see in the Google data set, the difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has much more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

**Up to this point, we find that the iOS App Store is dominated by games and apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps.** These frequency tables do NOT tell us where the users are located though, so moving forward, we will find out where the concentration of users are.


# Analyzing the Data p2 - Population by Category / Genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` (index 5) column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` (index 5) column.

In [36]:
def string_to_int(numstring):
    '''the installs in the google dataset have a "+" sign after along with "," commas and we'll need to use this function to convert these strings to ints'''
    y = ''
    for num in numstring:
        try:
            int(num)
            y += num
        except ValueError:
            pass
    return int(y)
    

# creating new dictionaries that will store average users per category:
avg_users_google, avg_users_apple = {}, {}

# getting our ordered frequency table by category
categories_google = freq_table(google_v4, 1)
categories_apple = freq_table(apple_v4, 11)

# looping through the categories/genres and calculating the average number of installs for each category and then assigning it to the newly created dictionaries
# (total number of installs in that category / number of apps in that category)
for category in categories_google:
    installs = 0
    number_apps = 0
    for app in google_v4:
        if app[1] == category:
            # installs is at index 5, we are using our function created above to turn the string into an int
            installs += string_to_int(app[5])
            number_apps += 1
    avg_users_google[category] = round(installs / number_apps, 2)

for category in categories_apple:
    reviews = 0
    number_apps = 0
    for app in apple_v4:
        if app[11] == category:
            reviews += int(app[5])
            number_apps += 1
    avg_users_apple[category] = round(reviews / number_apps, 2)
    
    
# sorting in descending order
    # the key is what the sorted() function uses to sort. We are passing a function that returns the index of the item at position 1
sorted_by_value_google = sorted(avg_users_google.items(), key=lambda kv: kv[1], reverse = True)
sorted_by_value_apple = sorted(avg_users_apple.items(), key=lambda kv: kv[1], reverse = True)

# Google Store Data Analysis

In [40]:
print('GOOGLE STORE Average Users per app stratified by Category - based on # of Installs\n')
for item in sorted_by_value_google:
    print(item[0], ':', item[1])

GOOGLE STORE Average Users per app stratified by Category - based on # of Installs

COMMUNICATION : 45419276.46
VIDEO_PLAYERS : 26565731.08
SOCIAL : 26132614.55
PRODUCTIVITY : 19177522.8
PHOTOGRAPHY : 18515001.24
GAME : 16046473.6
TRAVEL_AND_LOCAL : 15817915.36
TOOLS : 11763240.71
ENTERTAINMENT : 11640705.88
NEWS_AND_MAGAZINES : 11117853.0
BOOKS_AND_REFERENCE : 10156777.56
SHOPPING : 7652095.9
PERSONALIZATION : 6140956.19
WEATHER : 5307378.79
HEALTH_AND_FITNESS : 4705903.95
MAPS_AND_NAVIGATION : 4299595.47
SPORTS : 4228586.26
FAMILY : 4038093.46
BUSINESS : 2394383.08
FOOD_AND_DRINK : 2137555.82
ART_AND_DESIGN : 1986335.09
EDUCATION : 1837378.64
LIFESTYLE : 1647068.33
HOUSE_AND_HOME : 1540827.0
FINANCE : 1517130.87
DATING : 985340.18
COMICS : 767713.46
LIBRARIES_AND_DEMO : 746988.57
AUTO_AND_VEHICLES : 727121.92
PARENTING : 634204.29
BEAUTY : 611961.36
EVENTS : 312761.18
MEDICAL : 146924.42


So we can see that in the Google store, the category with the highest number of users on average is by far `COMMUNICATION` which nearly doubles the average of the 2nd and 3rd place `VIDEO_PLAYERS` and `SOCIAL` apps. 

Something to be careful of before making any implications about our data is that there may be some outliers that have a huge amount of installs that shift our averages. Let's take a look at the communication apps in the google store:


In [45]:
for app in google_v4:
    if app[1] == 'COMMUNICATION':
        print(app[0], ':', app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+
My Tele2 : 5,000,000+
Firefox Browser fast & private : 100,000,000+
Yahoo Mail – Stay Organized : 100,000,000+
imo beta free calls and text : 100,000,000+
imo free video calls and chat : 500,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Opera Mini - fast web browser : 100,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
Opera Browser: Fast and Secure : 100,000,000+
TracFone My Account : 1,000,000+
Firefox Focus: The privacy browser : 1,000,000+
Google Voice : 10,000,000+
Chrome Dev : 5,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard

# iOS / Apple Store Data Analysis

In [42]:
print('iOS / APPLE STORE Average Users per app stratified by Genre - based on # of Reviews\n')
for item in sorted_by_value_apple:
    print(item[0], ':', item[1])

iOS / APPLE STORE Average Users per app stratified by Genre - based on # of Reviews

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22812.9
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0


In the iOS / Apple Store, based on number of reviews, we are implying that `Navigation`, `Reference`, and `Social Networking` have the highest number of users on average.