# App Store Analysis

## Using Data Analytics to Inform Business Strategy

## 1. Project Design

### 1.1 Company Background

The company XYZ is in the business of building Android and iOS mobile apps. These apps are free to download and install, and they are distributed through Google Play Store and the Apple App Store.

The company's main source of revenue is in-app advertising. This posits the business model as volume driven - scale in terms of the number of users becomes very important. The more the number of users who see and engage with the ads, proportionally better is the revenue opportunity.

### 1.2 Business Challenge

The senior management team is meeting for the annual strategy event to decide on allocation of resources and future app development roadmap. The team is seeking inputs from the business strategy group that will help the company maximize return-on-investment(ROI) opportunities. 

### 1.3 Project Scope

Our goal for this project is to offer actionable insights that are backed by data. Based on our understanding of the company's business model, we know that the biggest driver of ROI is the number of users for an app - the revenue opportunity is directly proportional. We will focus our exploration on this topic. 

Our project scope is to analyze app store data and identify the type of apps that are likely to attract more users. Such actionable intelligence can help optimize revenue and the company can focus on creating the kind of apps that are popular.

Our key requirements are as follows:

- We are interested in free apps only
- We are interested in apps in English language only


### 1.4 Sources of data

Apple Store Data: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps. This dataset contains data about ~7000 iOS apps as of July 2017.

Google Play Store Data: https://www.kaggle.com/lava18/google-play-store-apps. This dataset contains data about approximately 10,000 Android apps as of August 2018.

## 2. Data Preparation

### 2.1 Open Apple Play Store and Google Play Store data sets

In [1]:
# Open both datasets and save them as list of lists.

from csv import reader

#Apple Store Dataset
opened_file = open("./AppleStore.csv")
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

#Google Play Store Dataset
opened_file = open("./googleplaystore.csv")
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### 2.2 Create explore_data() function

In [2]:
# Defining a function to make it easy to print data

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print("\n") #Add a new empty line after each row
        
    if rows_and_columns:
        print("Number of rows: " + str(len(dataset)))
        print("Number of columns: " + str(len(dataset[0])))

### 2.3 Exploring the datasets

Let's look at the structure of the two datasets that we have created. For each dataset, we would like to know the following:

- Number of columns in each dataset to learn about the headers
- Number of rows in each dataset to learn about the total number of entries

Let's use the explore_data() function that we created to gather these insights.

#### 2.3.1 Exploring the iOS dataset

In [3]:
print("ios header:\n", ios_header, "\n")
explore_data(ios, 0, 2, True)

ios header:
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


#### 2.3.2 Exploring the android dataset

In [4]:
print("android header:\n", android_header, "\n")
explore_data(android, 0, 2, True)

android header:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


#### 2.3.3 Results of the data structure analysis

| Data set | Number of Rows | Numer of Columns | Column Names |
| ------   | ------         | ------       | ------ |
| iOS      | 7197           | 16 | 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'     |
| Android  | 10841          | 13 | 'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'|

## 3. Data Cleansing

Our two datasets, in their current format, are a list of lists. However, we cannot use them right away. The data needs to be cleaned and prepared so that we do not get any wrong results in our analysis. As per our requirements, we need to remove all paid apps and non-English language apps too.

We will focus on the following three steps that are integral to any data cleaning process:

- remove or correct wrong data
- remove duplicate data
- modify the data to fit the purpose of our analysis

### 3.1 Finding and deleting erroneous data in Google Play Store dataset

#### 3.1.1 Deleting wrong data

In the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play Store dataset, [this discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on Google Play Store describes missing data for row 10472. Let's check if this is the case by matching the length of the entry 10472 to the length of the header.

In [5]:
# Checking the index of the entry with missing data

for row in android:
    if len(row) != len(android_header):
        print("Row with wrong data: ", android.index(row))
        print(row)

Row with wrong data:  10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Let us find what's wrong with the data in this row. We will compare this row with another row from the dataset to find the anomaly.

In [6]:
print("android header: :\n", android_header, "\n")
print("Row #1 of the data set: \n", android[0], "\n")

android header: :
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Row #1 of the data set: 
 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 



As we compare rows 1 and 10472, we find that row 10472 has a missing Category, and all other data has moved left by one column (for example, the reviews column for row 10472 shows '3.0M' which is wrong). We need to remove this row as the wrong data will create errors in our analysis.

In [7]:
print("Total number of rows before deletion: ", len(android))
del android[10472] #do not run this more than once
print("Total number of rows after deletion: ", len(android))

Total number of rows before deletion:  10841
Total number of rows after deletion:  10840


#### 3.1.2 Deleting duplicate entries

Let's find if there are any duplicate entries in the dataset. We can do this by creating two lists, viz., one list containing unique apps and the other list containing duplicate apps. Further to this, we will dig deeper into the duplicate apps list to find a way to select the right data in the android dataset and remove duplicate entries.

In [8]:
android_clean = []    #create an empty list to store unique apps
android_duplicate = []    #create an empty list to store duplicate apps

for row in android:
    if row[0] in android_clean:    #checking by App Name
        android_duplicate.append(row[0])
    else:
        android_clean.append(row[0])
        
print("Number of rows in the cleaned dataset: ", len(android_clean))
print("Number of rows in the duplicate dataset: ", len(android_duplicate))

duplicate_counts = {}

for app in android_duplicate:
    if app in duplicate_counts:
        duplicate_counts[app] += 1
    else:
        duplicate_counts[app] = 1

print("App with the highest number of duplicate entries:", max(duplicate_counts, key=duplicate_counts.get))

Number of rows in the cleaned dataset:  9659
Number of rows in the duplicate dataset:  1181
App with the highest number of duplicate entries: ROBLOX


Our initial analysis shows that there are 1181 duplicate entries in our dataset. We have also found that the ROBLOX app has the highest number of duplicate entries. Let's print all entries with the ROBLOX app name in the android dataset to identify which entry is most relevant and also find a way to remove duplicates.

In [9]:
for app in android:
    if app[0] == "ROBLOX":
        print(app)

['ROBLOX', 'GAME', '4.5', '4447388', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4447346', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4448791', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449882', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'GAME', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1 and up']
['ROBLOX', 'FAMILY', '4.5', '4449910', '67M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Adventure;Action & Adventure', 'July 31, 2018', '2.347.225742', '4.1

We notice that the entries differ based on the total number of reviews, which is column 4 (index 3). We can form a hypothesis to sort duplicates by reviews - the higher the number of reviews, the recent the data should be.

In [10]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print("Expected length of clean list: ", len(android) - len(android_duplicate))
print("Actual length of the new sorted data: ", len(reviews_max))

Expected length of clean list:  9659
Actual length of the new sorted data:  9659


In [11]:
android_clean = []    #stores the new cleaned dataset
already_added = []    #stores app names to avoid duplicates in case the review count is same

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [12]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### 3.2 Deleting wrong data in Apple Store dataset

[This discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176) on App Store data mentions presence of duplicate data. Let's find the duplicate data in our ios dataset.

#### 3.2.1 Deleting wrong data and duplicates

In [13]:
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [14]:
# Create two empty lists - one list contains unique data and the other list contains duplicates.

ios_clean = []
ios_duplicate_apps = []

print("Number of rows in the original dataset: ", len(ios))
for app in ios:
    if app[0] in ios_clean:    #checking by id column
        ios_duplicate_apps.append(app[0])
    else:
        ios_clean.append(app[0])
        
print("Number of rows in the cleaned dataset: ", len(ios_clean))
print("Number of rows in the duplicate dataset: ", len(ios_duplicate_apps))

Number of rows in the original dataset:  7197
Number of rows in the cleaned dataset:  7197
Number of rows in the duplicate dataset:  0


The App Store dataset does not seem to have any app with duplicate entries.

### 3.3 Removing non-English Apps

As mentioned in Section 1.3, one of the key requirements of our project is to focus on apps that are in English language only. As we analyze our datasets, we find that are many apps that are not designed for English speaking audiences. We will remove these apps from the list by first identifying them and then removing them.

#### 3.3.1 Identification Strategy

We will analyze the app names to check if they contain non-English characters. Most English characters fall within the ASCII range of 0 to 127. We can check if each character in an app name meets this criteria to filter the apps by English and non-English.

In [15]:
string = 'abc'
newstring = [each for each in string]
print(newstring)

['a', 'b', 'c']


In [16]:
def is_english(app_name):
    char_list = [x for x in app_name]
    non_eng_count = 0    #allowing 3 non-English characters to provide for emojis and special characters
    for each in char_list:
        if ord(each) > 127:
            non_eng_count += 1
    if non_eng_count > 3:
        return False
    else:
        return True

In [17]:
is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

### 3.3.2 Removing non-English apps

In [18]:
ios_English_only = []
android_English_only = []

for app in ios:
    name = app[1]
    if is_english(name):
        ios_English_only.append(app)
        
for app in android_clean:
    name = app[0]
    if is_english(name):
        android_English_only.append(app)

In [19]:
explore_data(ios_English_only, 0, 2, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6183
Number of columns: 16


In [20]:
explore_data(android_English_only, 0, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


### 3.3.3 Isolating the free apps

In [21]:
print("ios_header: ", ios_header)
print("\n")
print("android header: ", android_header)

ios_header:  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


android header:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [22]:
ios_eng_free = []
android_eng_free = []

for app in ios_English_only:
    if app[4] == '0.0':
        ios_eng_free.append(app)
        
for app in android_English_only:
    if app[7] == '0':
        android_eng_free.append(app)

In [23]:
explore_data(ios_eng_free, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


In [24]:
explore_data(android_eng_free, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


After datasets are now ready - we have 3222 apps in the iOS dataset and 8864 apps in the android dataset.

## 4 Data Analysis

Our validation strategy for an app idea has three steps:

- Build a minimal Android version of the app, and add it to Google Play
- If the app has a good response from users, we develop it further
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store

We need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's build frequency tables to determine the most common genres for each market.

In [25]:
print("ios_header: ", ios_header)
print("\n")
print("android header: ", android_header)

ios_header:  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


android header:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We will need to build frequency table for the prime_genre column of the App Store dataset, and for the Genres and Category columns of the Google Play Store dataset.

Let's build two functions to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function we can use to display the percentages in a descending order

In [26]:
def freq_table(dataset, index):
    dataset_freq_table = {}
    count_category = 0
    
    for app in dataset:
        category = app[index]
        count_category += 1
        if category in dataset_freq_table:
            dataset_freq_table[category] += 1
        else:
            dataset_freq_table[category] = 1
            
    freq_table_percentage = {}
    for key in dataset_freq_table:
        percentage = round((dataset_freq_table[key] / count_category) * 100, 4)
        freq_table_percentage[key] = percentage
        
    return freq_table_percentage

In [27]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [28]:
# A faster way to display the genre with max ratings without defining display_table() function

# ios_freq = freq_table(ios_eng_free, 11)
# print(max(ios_freq, key=ios_freq.get))

# android_freq = freq_table(android_eng_free, 1)
# print(max(android_freq, key=android_freq.get))

# android_freq = freq_table(android_eng_free, 9)
# print(max(android_freq, key=android_freq.get))

In [29]:
display_table(ios_eng_free, 11)    #frequency table for prime_genre column

Games : 58.1626
Entertainment : 7.8833
Photo & Video : 4.9659
Education : 3.6623
Social Networking : 3.2899
Shopping : 2.6071
Utilities : 2.514
Sports : 2.1415
Music : 2.0484
Health & Fitness : 2.0174
Productivity : 1.7381
Lifestyle : 1.5829
News : 1.3346
Travel : 1.2415
Finance : 1.1173
Weather : 0.869
Food & Drink : 0.807
Reference : 0.5587
Business : 0.5276
Book : 0.4345
Navigation : 0.1862
Medical : 0.1862
Catalogs : 0.1241


We find Games to be the most common genre followed by Entertainment. We also notice that most apps are designed for fun (games, entertainment, photo & video, social networking, sports, music), while a relatively smaller number of apps are designed for practical purposes (education, shopping, utilities, productivity, lifestyle).

Based on this analysis, we would recommend an app profile that falls in the fun category. We can consider Games, Entertainment and Photo & Video as genres of interest. However, as of now, we do not know if these genres have the highest number of users.

In [30]:
display_table(android_eng_free, 1)    #frequency table for Category column

FAMILY : 18.9079
GAME : 9.7247
TOOLS : 8.4612
BUSINESS : 4.5916
LIFESTYLE : 3.9034
PRODUCTIVITY : 3.8921
FINANCE : 3.7004
MEDICAL : 3.5311
SPORTS : 3.3958
PERSONALIZATION : 3.3168
COMMUNICATION : 3.2378
HEALTH_AND_FITNESS : 3.0799
PHOTOGRAPHY : 2.9445
NEWS_AND_MAGAZINES : 2.7978
SOCIAL : 2.6625
TRAVEL_AND_LOCAL : 2.3353
SHOPPING : 2.245
BOOKS_AND_REFERENCE : 2.1435
DATING : 1.8615
VIDEO_PLAYERS : 1.7938
MAPS_AND_NAVIGATION : 1.3989
FOOD_AND_DRINK : 1.241
EDUCATION : 1.162
ENTERTAINMENT : 0.9589
LIBRARIES_AND_DEMO : 0.9364
AUTO_AND_VEHICLES : 0.9251
HOUSE_AND_HOME : 0.8236
WEATHER : 0.801
EVENTS : 0.7107
PARENTING : 0.6543
ART_AND_DESIGN : 0.6431
COMICS : 0.6205
BEAUTY : 0.5979


In [31]:
display_table(android_eng_free, 9)    #frequency table for Genres column

Tools : 8.4499
Entertainment : 6.0695
Education : 5.3475
Business : 4.5916
Productivity : 3.8921
Lifestyle : 3.8921
Finance : 3.7004
Medical : 3.5311
Sports : 3.4634
Personalization : 3.3168
Communication : 3.2378
Action : 3.1024
Health & Fitness : 3.0799
Photography : 2.9445
News & Magazines : 2.7978
Social : 2.6625
Travel & Local : 2.324
Shopping : 2.245
Books & Reference : 2.1435
Simulation : 2.042
Dating : 1.8615
Arcade : 1.8502
Video Players & Editors : 1.7712
Casual : 1.7599
Maps & Navigation : 1.3989
Food & Drink : 1.241
Puzzle : 1.1282
Racing : 0.9928
Role Playing : 0.9364
Libraries & Demo : 0.9364
Auto & Vehicles : 0.9251
Strategy : 0.9138
House & Home : 0.8236
Weather : 0.801
Events : 0.7107
Adventure : 0.6769
Comics : 0.6092
Beauty : 0.5979
Art & Design : 0.5979
Parenting : 0.4964
Card : 0.4513
Casino : 0.4287
Trivia : 0.4174
Educational;Education : 0.3949
Board : 0.3836
Educational : 0.3723
Education;Education : 0.3384
Word : 0.2595
Casual;Pretend Play : 0.2369
Music : 0.20

As we review the genres and categories, we notice that the apps designed for practical pruposes (Tools, Business, Lifestyle, Productivity, Finance) are more or less similar in number to entertainment apps (Games, social, shopping).

Let's find out the average number of installs for each genre so that we can identify what genres are most popular. This information is easily available for Google Play Store dataset. However, in case of App Store dataset, we will use the total number of user ratings as a proxy.

Let's start with calculating the average number of user ratings per app genre on the App store. The steps we need to undertake:

- Isolate the apps of each genre
- Add up the user ratings for the apps of that genre
- Divide the sum by the number of apps belonging to that genre

In [32]:
ios_prime_genres = freq_table(ios_eng_free, 11)

In [33]:
for genre in ios_prime_genres:
    total = 0
    len_genre = 0
    for app in ios_eng_free:
        genre_app = app[11]
        if genre_app == genre:
            total += float(app[5])
            len_genre += 1
    average_rating = total / len_genre
    ios_prime_genres[genre] = average_rating
    
dict(sorted(ios_prime_genres.items(), key=lambda item: item[1]))

#     print(genre, ':', average_rating)
# print("\nGenre with highest average user ratings: ", max(ios_prime_genres, key=ios_prime_genres.get))

{'Medical': 612.0,
 'Catalogs': 4004.0,
 'Education': 7003.983050847458,
 'Business': 7491.117647058823,
 'Entertainment': 14029.830708661417,
 'Lifestyle': 16485.764705882353,
 'Utilities': 18684.456790123455,
 'Productivity': 21028.410714285714,
 'News': 21248.023255813954,
 'Games': 22788.6696905016,
 'Sports': 23008.898550724636,
 'Health & Fitness': 23298.015384615384,
 'Shopping': 26919.690476190477,
 'Travel': 28243.8,
 'Photo & Video': 28441.54375,
 'Finance': 31467.944444444445,
 'Food & Drink': 33333.92307692308,
 'Book': 39758.5,
 'Weather': 52279.892857142855,
 'Music': 57326.530303030304,
 'Social Networking': 71548.34905660378,
 'Reference': 74942.11111111111,
 'Navigation': 86090.33333333333}

From the analysis of the frequency table, we observe that apps belonging to the Navigation, Reference, Social Networking and Music genres have the highest number of user ratings on average. 

Let's do a deep dive into these popular genres to explore what kind of apps are included in them.

In [34]:
navigation = []
for app in ios_eng_free:
    if app[11] == 'Navigation':
        navigation.append(app)
        
print(navigation)

[['323229106', 'Waze - GPS Navigation, Maps & Real-time Traffic', '94139392', 'USD', '0.0', '345046', '3040', '4.5', '4.5', '4.24', '4+', 'Navigation', '37', '5', '36', '1'], ['585027354', 'Google Maps - Navigation & Transit', '120232960', 'USD', '0.0', '154911', '1253', '4.5', '4.0', '4.31.1', '12+', 'Navigation', '37', '5', '34', '1'], ['329541503', 'Geocaching®', '108166144', 'USD', '0.0', '12811', '134', '3.5', '1.5', '5.3', '4+', 'Navigation', '37', '0', '22', '1'], ['504677517', 'CoPilot GPS – Car Navigation & Offline Maps', '82534400', 'USD', '0.0', '3582', '70', '4.0', '3.5', '10.0.0.984', '4+', 'Navigation', '38', '5', '25', '1'], ['344176018', 'ImmobilienScout24: Real Estate Search in Germany', '126867456', 'USD', '0.0', '187', '0', '3.5', '0.0', '9.5', '4+', 'Navigation', '37', '5', '3', '1'], ['463431091', 'Railway Route Search', '46950400', 'USD', '0.0', '5', '0', '3.0', '0.0', '3.17.1', '4+', 'Navigation', '37', '0', '1', '1']]


### Finding most popular apps on Google Play Store

In [35]:
display_table(android_eng_free, 5)

1,000,000+ : 15.7265
100,000+ : 11.5523
10,000,000+ : 10.5483
10,000+ : 10.1986
1,000+ : 8.3935
100+ : 6.9156
5,000,000+ : 6.8254
500,000+ : 5.5618
50,000+ : 4.7721
5,000+ : 4.5126
10+ : 3.5424
500+ : 3.2491
50,000,000+ : 2.3014
100,000,000+ : 2.1322
50+ : 1.9179
5+ : 0.7897
1+ : 0.5077
500,000,000+ : 0.2708
1,000,000,000+ : 0.2256
0+ : 0.0451
0 : 0.0113


In [36]:
android_category = freq_table(android_eng_free, 1)

In [37]:
for category in android_category:
    total = 0
    len_category = 0
    
    for app in android_eng_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    average_category = total / len_category
    android_category[category] = average_category
    print(category, ':', average_category)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_