# Profitable App Profiles for the App Store and Google Play Markets

For free mobile device applications, such as those offered by the App Store and Google Play, the main source of revenue comes from in-app advertising. The amount of users downloading and using the app is therefore important, since more users means more potential revenue.  

The project uses the two different data sets to perform an analysis that will help us better understand which apps are likely to attract more users. 

The two data sets are:
- A data set of approximately 7000 iOS mobile device application from the App Store, which can be downloaded [here](https://dq-content.s3.amazonaws.com/350/AppleStore.csv). 
- A data set of appoximately 10000 Android mobile device application from Google Play, available for download [here](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

  

## Opening and Exploring the Data 

The below code opens the two datasets and converts both the Apple dataset and Android dataset into a list of lists. 

It creates for each:
* a header-only dataset, which lists only the column names (`apple_header` / `google_header`)
* and a content-only dataset (`apple_data` / `google_data`). 



In [2]:
from csv import reader

## Apple Store Data Set ##
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
apple_data = list(read_file)
apple_header = apple_data[0]
apple_data = apple_data[1:]

## Google Play Data Set ##
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
google_data = list(read_file)
google_header = google_data[0]
google_data = google_data[1:]




The following function allows us to explore more fully the two datasets. The function takes as input a dataset name, start-end indices that represent slice of the dataset you want to see, and allows you to specify whether to display number of rows and columns.    

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Here, I am using the `explore_data` function to examine, first, part of the `apple_data` dataset, and following that, part of the `google_data` dataset.

In [4]:
explore_data(apple_data,0,5,True)






['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [5]:
explore_data(google_data, 0, 5,True)



['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


## Deleting Wrong Data

Row 10472 in google_data has a column missing. 
`print(google_data[10472])`
The row is deleted: 

In [6]:
print(google_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
del google_data[10472]

After deleting row 10472, the total length is re-counted, to check that one row has been removed:

In [8]:
explore_data(google_data, 0, 5, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10840
Number of columns: 13


## Removing Duplicate Entries (i)

The Google Play data set has duplicate entries. 
Instagram has four row entries in the data set. 
The below code uses a `for loop` to print every entry for `Instagram`

In [9]:
for app in google_data:
    app_name = app[0]
    if app_name == 'Instagram':
        print(app)


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


The below code identifies the duplicate apps in Google Play Store data set.

It does this by:
- creating two empty lists: `google_unique` and `google_duplicate`.
- looping through `google_data` using `app_name`
- checking if `app_name` is present in list `google_unique`
- if `app_name` is already present, it adds `app_name` to `google_duplicate`.
- if `app_name` is not already present in `google_unique`, it adds the app to `google_unique` list. 

It then: 
- uses the `len()` function to print the number of apps in `google_duplicate`. 
- uses the `len()` function to print the number of unique apps
- uses the `print()` function to display the first 20 rows in `google_duplicate`.

In [10]:
google_unique = []
google_duplicate = []

for app in google_data:
    app_name = app[0]
    
    if app_name in google_unique:
        google_duplicate.append(app_name)
    else:
        google_unique.append(app_name)


print('Number of Duplicate Apps:', len(google_duplicate))
print('Number of Unique Apps:',len(google_unique))
print('\n')
print(google_duplicate[:20])


Number of Duplicate Apps: 1181
Number of Unique Apps: 9659


['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


There are 1,181 duplicates in Google Play Store data set. These entries will need to be removed so as not to distort the data set. 

Looking at the four entries for Instagram above, it becomes clear that the duplication stems from a difference in number of ratings (column 4). 

I am going to retain the entry with the highest number of ratings, since this suggests that it was the most up-to-date entry. And for our purpose of looking at user ratings, the higher the number of ratings available, the better. 

This means that I will need to delete the remaining duplicate entries from the data set. 

## Removing Duplicate Entries (ii)

I made the decision to retain the row with the highest number of reviews, and delete the duplicate entries. 
As a first step, I'm going to identify each row that contains the maximum number of review for an app, and store the values in a dictionary. 
In order to do this:
- I created an empty dictionary called `reviews_max`
- I created a variable `name` for app name, and another called `n_reviews`, for number of reviews.  
- I looped through google_data. The loop checked if name was in the dictionary and also if n_reviews was greater than the dictionary value. If both conditions were met, the dictionary value was updated. If name was not in the dictionary, the key-value pair (name : n_reviews) was inserted in the dictionary. 

The dictionary therefore contained only the app name with the maximum number of reviews from the google data set. 

I used the len() function to confirm that the number of key-value pairs in the dictionary matched the expected number (9659 rows). 


In [11]:
reviews_max = {}

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

len(reviews_max)
        




9659

Next, I used the data stored in the dictionary to create a new, clean data set (`android_clean`). 

- I created two empty lists, android_clean and already_added. 
- For google_data, I assigned the variable `name` for app name, and `n_reviews` for number of reviews. 
- I looped through google_data. If `n_reviews` in `google_data` matched value stored in `reviews_max` dictionary, and `name` was not in `already_added` list, then the whole row was added to `android_clean`. 


In [12]:
android_clean = []
already_added = []

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)

print(len(android_clean))
print(android_clean[0:5])
print('\n')
print(already_added[0:5])
    

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


['Photo Editor & Candy Camera & Grid & ScrapBook', 'U Launcher Lite – FREE Live Cool The

## Removing Non-English Apps (i):

Some of the apps in the data set are non-English apps. Since we are only interested in English-speaking apps for our analysis, it is desirable to identify and remove any non-English language apps from the data set. 

The below function is designed to detect whether a particular string contains a non-English-language character.
It:
- loops over each character in the string
- determines whether the character has an ordinal of more than 127 (0-127 correspond to English language characters). 
- if it does, the function returns `False`, otherwise it returns `true`.

In [13]:
def eng_check(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True
  

In [14]:
print(eng_check('Instagram'))
print(eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_check('Docs To Go™ Free Office Suite'))
print(eng_check('Instachat 😜'))

True
False
False
False


Some English language app names contain characters that fall outside of the 0-127 range (such as emojis). Since we don't want these apps to get excluded from our analysis, the below modified function allows for up to 3 non-English-language characters to be in the string. It:

- starts with an empty count
- loops through each character of the string
- if character falls outside of the 0-127 range, it adds 1 to the empty list. 
- if the list is greater than 3, the function returns `False` (non-English app), otherwise it returns `True`. 

In [15]:
def eng_check(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    else:
        return True

In [16]:
eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [17]:
google_eng = []
apple_eng = []

for app in android_clean:
    name = app[0]
    if eng_check(name):
        google_eng.append(app)

for app in apple_data:
    name = app[1]
    if eng_check(name):
        apple_eng.append(app)

print(google_eng[0:5])
print('Length google_eng: ',len(google_eng))
print('\n')
print(apple_eng[0:5])
print('Length apple_data: ',len(apple_eng))
            

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
Length google_eng:  9614


[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '21

## Isolating Free Apps

Both data sets specify whether or not the app is free. 

For the apple data set, this is column 5 (index = 4).

For the google data set, this is column 8 (index = 7).

In [18]:
print(apple_header)
print(google_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The below code isolates only the free Android apps and the free Apple apps. 
It:
- creates two empty lists (`google_free` and `apple_free`)
- loops through the cleaned google data set, and where `price == '0'`, adds the app to `google_free`
- loops through the cleaned apple data set, and where `price == '0.0'`, adds the app to `apple_free`
- finally, it prints a sample from both lists, and the final count of each. 

NB: the Android data lists free apps as '0', whereas the Apple data set lists free apps as '0.0'

In [19]:
google_free = []
apple_free = []

for app in google_eng:
    price = app[7]
    if price == '0':
        google_free.append(app)

for app in apple_eng:
    price = app[4]
    if price == '0.0':
        apple_free.append(app)
        
print(google_free[0:5])
print('Number of free google apps:' , len(google_free))
print(apple_free[0:5])
print('Number of free apple apps: ',len(apple_free))
        

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
Number of free google apps: 8864
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676

## Most Common Apps by Genre (i)

The aim of the analysis is to determine which types of app are likely to attract the most users, since revenue from in-app advertising is influenced by the number of people using the app. 

The validation strategy for the app's development is to:
- build an Android version of the app
- release on Google Play
- if there is a good response, develop the app further
- if the app becomes profitable after 6 months, build an iOS version of the app, to be added to the App Store. 

This analysis therefore seeks to find out which types of apps available from both app stores are the most common and the most popular. 

If we explore the cleaned Android dataset of free apps `google_free`, we could use several data points in our analysis:

- Category
- Rating
- Reviews
- Installs
- Genre

In [20]:
print(google_header)
explore_data(google_free, 200, 210,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Easy Installer - Apps On SD', 'BUSINESS', '4.1', '23055', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'June 8, 2018', 'Varies with device', 'Varies with device']


['IndiaMART: Search Products, Buy, Sell & Trade', 'BUSINESS', '4.5', '207372', '11M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', '12.2.4', '4.0 and up']


['ViettelPost express delivery', 'BUSINESS', '4.4', '1225', '25M', '100,000+', 'Free', '0', 'Everyone', 'Business', 'August 1, 2018', '1.0.6.8', '4.1 and up']


['MyASUS - Service Center', 'BUSINESS', '4.4', '380837', '7.3M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 4, 2018', '3.4.5', '4.2 and up']


['Job Korea - Career Jobs', 'BUSINESS', '4.3', '10600', '6.5M', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 1, 2018', '2.5.6', '4.0 and up

Similarly, if we explore the cleaned Apple data set of free apps, `apple_free`, we could use the following data points in our analysis:

- rating_count_tot
- user_rating
- prime_genre


In [21]:
print (apple_header)
print('\n')
explore_data(apple_free, 200, 210, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['696565994', 'Shadow Fight 2', '144713728', 'USD', '0.0', '99206', '1876', '4.5', '4.5', '1.9.29', '12+', 'Games', '40', '5', '1', '1']


['339440515', 'Voice Changer Plus', '40722432', 'USD', '0.0', '98777', '233', '3.0', '4.0', '5.01', '4+', 'Entertainment', '37', '2', '12', '1']


['429610587', 'iFunny :)', '66599936', 'USD', '0.0', '98344', '877', '3.5', '4.5', '4.6.8', '17+', 'Entertainment', '37', '2', '2', '1']


['491730359', 'The CW', '30552064', 'USD', '0.0', '97368', '9', '4.5', '3.5', 'v2.13.9', '12+', 'Entertainment', '37', '5', '9', '1']


['806077016', 'War Robots', '600535040', 'USD', '0.0', '97122', '380', '4.5', '4.5', '2.9.0', '12+', 'Games', '38', '5', '17', '1']


['372648912', 'MeetMe - Chat and Meet New People', '133956608', 'USD', '0.0

In terms of identifying the most common genres, the following data points would be most useful:

Android:

- `Category`
- `Genre`

Apple:

- `prime_genre`

## Most Common Apps by Genre (ii):

The below code defines a function that calculates the frequency of a particular column (in this case, we are looking for the genre of the app) and displays the frequency as a percentage. 

Two do this it creates two dictionaries. 

The first uses a for loop to counts how many times a particular value (genre) features in the dataset, by adding it the `value_freq` dictionary. It also counts the total number of values in the dataset. 

The second uses a for loop to calculate each dictionary value as a percentage, and adds the percentage to the new dictionary `freq_percentages`

The function then returns all the values as percentages. 

In [22]:
def freq_table(dataset, index):
    value_freq = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in value_freq:
            value_freq[value] += 1
        else:
            value_freq[value] = 1
    
    freq_percentages = {}
    
    for key in value_freq:
        percentage = (value_freq[key] / total) * 100
        freq_percentages[key] = percentage
        
    
    return freq_percentages

 


In [23]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Using the in-built funtion `display_table`, we can view what the percentage of certain genres are in each dataset, ordered from most common to least common. (NB: the function of `display_table` is to convert the dictionary of percentages into tuples, which is then sorted from high to low.)

Here are the different genres featuring in the clean Apple dataset using `prime genre` column, now ranked from highest to low.  

In [24]:
display_table(apple_free,11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Here are the different genres featuring in the clean Android dataset using the `Genres` column, now ranked from highest to low:

In [25]:
display_table(google_free,9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

Here are the different genres featuring in the clean Android dataset using the `Category` column, now ranked from highest to low. Of the two columns, this is more generalised than the Genres column and gives a clearer picture of the different categories. 

In [26]:
display_table(google_free,1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

# Most Common Apps by Genre (iii):

## Analysis (i): The App Store

What is the most common genre? What is the runner-up?

The most common genre for free apps on the App Store is "Games" (58%), which is the most common genre by a large margin. The runner-up is "Entertainment", at 8%. 

What other patterns do you see?

Hobbies (social networking, photo, music, sport) all feature with similar percentages (2-4%), still very low compared to Games) as the next most common free apps, while information-based apps (news, weather, finance, business) feature even less. However, it could be that these apps are more commonly paid for on the App Store. 

What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?

Generally, the free apps tend to be for entertainment purposes (games, social networking, photo, sports, etc.).

Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?

It's difficult for a few reasons:
- while gaming apps are the most common by a long way, we'd have to make an app that distinguishes itself from the other large number of free gaming apps. 
- we would have to re-examine the data and see what the average number of users is for these gaming apps. They may be the most common, but we can't tell from this table whether they have the most users. 
- is The App Store dataset the most appropriate for researching free apps? It could be that the majority of customers are accustomed to paying for apps. 


## Analysis (ii): Google Play

*What are the most common genres?*

The most common genres are "Family" (19%), and Games (9%), and Tools (8%). 

*What other patterns do you see?*

*Compare the patterns you see for the Google Play market with those you saw for the App Store market.*

The two lists are not directly comparable. It's not clear what some genres represent (Google Play = "Family") if we try to think what might be a comparable genre on The App Store.
On Google Play, genres like Business, Finance and Medical rate in the higher percentages of apps, rather than ranking at the bottom, as they do on The App Store. 


*Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?*

It would be difficult to recommend an app based on this data alone. We would also have to examine number of users. 

## Most Popular Apps by Genre on the App Store

The above helped us calculate which genre of apps was most common in each data set. However, it did not tell us how many users were using the apps (e.g. it could be that Games is the most common genre, but has the least amount of users). It would be interesting to know what the average numbers of users is per genre, so that we could get a clearer idea of which genre is the most popular. 

Since the App Store data set doesn't record number of installs of an app, we will use the next best thing to analyse this data set: the number of user ratings per app, in order to calculate the average number of user ratings per genre. 

To do this, I needed to: 

- Count the number of user ratings per app 
- Add this to a total (for one particular genre)
- Count the number of apps belonging to a particular genre. 
- Divide the total number of user ratings per genre by the number of apps in that genre. 

I used a nested loop to do this. 

- First I got a complete list of genres (using freq_genre function).

For loop 1:
I looped through the complete set of genres. 
- I set `total` (the number of user ratings for the whole genre) at 0
- I set `len_genre` (number of apps per genre) at 0. 

For loop 2:
I looped through the apple dataset and every time the genre (`row[11]`) matched genre in the initial set, I:
- recorded the number of user ratings 
- added the number of user ratings to the total for that genre
- added 1 to `len_genre` (to count the number of apps in the genre) 

Then, under for loop 1, I divided total / len_genre to get the average number of user ratings for a particular genre. 

I printed each genre, together with the average number of user ratings for the genre, as well as the number of apps per genre. 


In [27]:
apple_genres = freq_table(apple_free,11)
    
for genre in apple_genres:
    total = 0 
    len_genre = 0
    for row in apple_free:
        genre_app = row[11]
        if genre_app == genre:
            num_ratings = float(row[5])
            total += num_ratings
            len_genre += 1
    genre_avg = total / len_genre
    print(genre,": ",genre_avg, "genre_count:", len_genre)


Social Networking :  71548.34905660378 genre_count: 106
Photo & Video :  28441.54375 genre_count: 160
Games :  22788.6696905016 genre_count: 1874
Music :  57326.530303030304 genre_count: 66
Reference :  74942.11111111111 genre_count: 18
Health & Fitness :  23298.015384615384 genre_count: 65
Weather :  52279.892857142855 genre_count: 28
Utilities :  18684.456790123455 genre_count: 81
Travel :  28243.8 genre_count: 40
Shopping :  26919.690476190477 genre_count: 84
News :  21248.023255813954 genre_count: 43
Navigation :  86090.33333333333 genre_count: 6
Lifestyle :  16485.764705882353 genre_count: 51
Entertainment :  14029.830708661417 genre_count: 254
Food & Drink :  33333.92307692308 genre_count: 26
Sports :  23008.898550724636 genre_count: 69
Book :  39758.5 genre_count: 14
Finance :  31467.944444444445 genre_count: 36
Education :  7003.983050847458 genre_count: 118
Productivity :  21028.410714285714 genre_count: 56
Business :  7491.117647058823 genre_count: 17
Catalogs :  4004.0 genre

 ### Analysis
Navigation is the most popular genre in the App Store dataset of free apps. 
Social Networking is the second most popular genre. 

I was surprised by these results. Navigation only has six apps, yet the number of user ratings is very high. 

It may be tempting at this point to recommend a navigation-based app for the app store, since this type of free app seems encourage a popular response from users, but we may need to take a closer look at if one particular app in this genre is immensely popular over another, or if all of the navigation apps generally generate a popular response. 

## Most Popular Apps by Genre on Google Play

The next step is to perform the same analysis of the Google Play data set, searching for the most popular genre of app, this time based on the number of installs of a particular app. 

The Google Play dataset doesn't supply a precise number for installs of a particular app, rather, it presents us with open-ended values (100+, 5,000+, 100,000,000+). 

However, we can still make use of this number because we can  use the figures to calculate an approximation of installs (e.g. we can simply assume 100+ = 100; 5,000+ = 5000, and use this to still get a good impression of popularity.) 

In order to do this, we will need to repeat the same nested loop code as above, but with additional steps to convert the data strings to floats. 

To calculate the average number of installs for a particular genre of app, I had to:

- Count the number of user ratings per app
- Add this to a total (for one particular genre)
- Count the number of apps belonging to a particular genre.
- Divide the total number of user ratings per genre by the number of apps in that genre.

Using a nested loop, I:

- First got a complete list of the Google Play categories (using the freq_genre function).

For loop 1: 
- I looped through the complete set of categories.
- I set total (which will be the number of installs for the whole category) at 0.
- I set len_genre (number of apps per genre) at 0.

For loop 2: 
- I looped through the Google Play (clean/free) dataset.
- Every time the category (`row[1]`) matched category in the initial set, I:
- recorded the installs value (a string)
- used `str.replace(old, new)` to remove the '+' and ',' elements. 
- converted the value to a float
- added this number to `total`
- added 1 to `len_category` (to count the number of apps in the category)

Then, under for loop 1, I divided total / len_category to get the average number of installs for a particular category.

I printed each category, together with the average number of installs for the category, as well as the number of apps per category.


In [29]:
google_category = freq_table(google_free,1)

for category in google_category:
    total = 0
    len_category = 0
    for row in google_free:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_installs = total/len_category
    print(category, ':', avg_installs, ':', len_category)
    

ART_AND_DESIGN : 1986335.0877192982 : 57
AUTO_AND_VEHICLES : 647317.8170731707 : 82
BEAUTY : 513151.88679245283 : 53
BOOKS_AND_REFERENCE : 8767811.894736841 : 190
BUSINESS : 1712290.1474201474 : 407
COMICS : 817657.2727272727 : 55
COMMUNICATION : 38456119.167247385 : 287
DATING : 854028.8303030303 : 165
EDUCATION : 1833495.145631068 : 103
ENTERTAINMENT : 11640705.88235294 : 85
EVENTS : 253542.22222222222 : 63
FINANCE : 1387692.475609756 : 328
FOOD_AND_DRINK : 1924897.7363636363 : 110
HEALTH_AND_FITNESS : 4188821.9853479853 : 273
HOUSE_AND_HOME : 1331540.5616438356 : 73
LIBRARIES_AND_DEMO : 638503.734939759 : 83
LIFESTYLE : 1437816.2687861272 : 346
GAME : 15588015.603248259 : 862
FAMILY : 3695641.8198090694 : 1676
MEDICAL : 120550.61980830671 : 313
SOCIAL : 23253652.127118643 : 236
SHOPPING : 7036877.311557789 : 199
PHOTOGRAPHY : 17840110.40229885 : 261
SPORTS : 3638640.1428571427 : 301
TRAVEL_AND_LOCAL : 13984077.710144928 : 207
TOOLS : 10801391.298666667 : 750
PERSONALIZATION : 520148

### Analysis:

The most popular categories of free apps on Google Play are: 
- Communication
- Video Players
- Social Media

The Communication category generally represents messaging and email apps. It is dominated by apps such as Whatsapp and Gmail:


In [36]:
comms_list = []

for row in google_free:
    category = row[1]
    name = row[0]
    installs = row[5]
    if category == 'COMMUNICATION':
        comms_list.append(name)
        comms_list.append(installs)
print(comms_list)

['WhatsApp Messenger', '1,000,000,000+', 'Messenger for SMS', '10,000,000+', 'My Tele2', '5,000,000+', 'imo beta free calls and text', '100,000,000+', 'Contacts', '50,000,000+', 'Call Free – Free Call', '5,000,000+', 'Web Browser & Explorer', '5,000,000+', 'Browser 4G', '10,000,000+', 'MegaFon Dashboard', '10,000,000+', 'ZenUI Dialer & Contacts', '10,000,000+', 'Cricket Visual Voicemail', '10,000,000+', 'TracFone My Account', '1,000,000+', 'Xperia Link™', '10,000,000+', 'TouchPal Keyboard - Fun Emoji & Android Keyboard', '10,000,000+', 'Skype Lite - Free Video Call & Chat', '5,000,000+', 'My magenta', '1,000,000+', 'Android Messages', '100,000,000+', 'Google Duo - High Quality Video Calls', '500,000,000+', 'Seznam.cz', '1,000,000+', 'Antillean Gold Telegram (original version)', '100,000+', 'AT&T Visual Voicemail', '10,000,000+', 'GMX Mail', '10,000,000+', 'Omlet Chat', '10,000,000+', 'My Vodacom SA', '5,000,000+', 'Microsoft Edge', '5,000,000+', 'Messenger – Text and Video Chat for Fre

A category that is less dominated by a few major players is Shopping:

In [58]:
for app in google_free:
   if app[1] == 'SHOPPING':
    print(app[0], ':', app[5])

Amazon for Tablets : 10,000,000+
OfferUp - Buy. Sell. Offer Up : 10,000,000+
Shopee - No. 1 Online Shopping : 10,000,000+
Shopee: No.1 Online Shopping : 10,000,000+
Kroger : 5,000,000+
Walmart : 10,000,000+
eBay: Buy & Sell this Summer - Discover Deals Now! : 100,000,000+
letgo: Buy & Sell Used Stuff, Cars & Real Estate : 50,000,000+
Amazon Shopping : 100,000,000+
Lazada - Online Shopping & Deals : 50,000,000+
OLX - Buy and Sell : 50,000,000+
The wall : 1,000,000+
Flipp - Weekly Shopping : 10,000,000+
Shrimp skin shopping: spend less, buy better : 5,000,000+
Lotte Home Shopping LOTTE Homeshopping : 5,000,000+
Horn, free country requirements : 1,000,000+
Jiji.ng : 1,000,000+
GS SHOP : 10,000,000+
The birth : 50,000,000+
Home & Shopping - Only in apps. 10% off + 10% off : 10,000,000+
EHS Dongsen Shopping : 1,000,000+
bigbasket - online grocery : 5,000,000+
Bukalapak - Buy and Sell Online : 10,000,000+
Life market : 1,000,000+
Jabong Online Shopping App : 10,000,000+
Family Dollar : 1,000

Some of these apps are based on huge retailers (Amazon, Walmart, IKEA) that have a physical presence or warehouse, which wouldn't be suitable for our purposes, but others serve to link shoppers to online or independent sellers (eBay, ASOS, MarketPlace). 

One possible recommendation would be for an app that links up shopper/user with a seller/service-provider based on local location, working for people who are keen to support local independent businesses and who are interested in shopping "green". 