## Profitable Application Analysis for AppleStore and Google PlayStore

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. 

The goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## Opening Data Sets
To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals:

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv) 

We'll start by opening these two data sets. 

In [1]:
from csv import reader


## Opening the Google PlayStore Apps
opened_playstore = open('googleplaystore.csv', encoding='utf8')
read_playstore = reader(opened_playstore)
playstore_list = list(read_playstore)
playstore_header = playstore_list[0]
playstore_content = playstore_list[1:]

## Opening the Apple AppStore Apps
opened_appstore = open('AppleStore.csv', encoding='utf8')
read_appstore = reader(opened_appstore)
appstore_list = list(read_appstore)
appstore_header = appstore_list[0]
appstore_content = appstore_list[1:]

## Exploring the Data Sets
To make them easier to explore, we create a function named `explore_data()` that you can repeatedly use to print rows in a readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

## We print contents from the Apple AppStore as shown below:

In [3]:
print(appstore_header)
print('\n')
print(explore_data(appstore_list, 1, 3, True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
None


-----------------------------------------------------------------
## Explanation Apple AppStore

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16

The relevant column here will be: `trackname`, `price`, `rating_count_tot`, `user_rating`, `content_rating`, and `prime genre`. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

------------------------------------------------------

## We print contents from the Google PlayStore as shown below:

In [4]:
print(playstore_header)
print('\n')
print(explore_data(playstore_list, 1, 3, True))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13
None


-----------------------------------------------------------------
## Explanation Google PlayStore

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


The relevant column here will be `App`, `Category`, `Rating`, `Installs`, `Type`, `Price`, and `Genre`. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

----------------------------------------------------------------

## Data Cleaning:
###  Data cleaning is done before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

By using [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) we can see that an error was described for a certain row. 

Below the row will be sliced and printed, because the user reporting the error might or might not have removed the header. This is to enable us identify the error and confirm if it is true.

In [5]:
print(explore_data(playstore_list, 10470, 10474, True))

['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10842
Number of columns: 13
None


## Deleting Data Set with Errors

From the output we can see that the last row(10473) has the error, and the reporter made a msiatke in identifying the right index.

### The row with the error

```['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
```

To delete the row we use the `del` statement, followed by the `list of list` and then the `index` of the row to be deleted.

> NOTE: We must make sure we don't run the `del` statement more than once, otherwise more than one row will be deleted.

In [6]:
del playstore_list[10473]

> Here we see if the index was really deleted

In [7]:
print(explore_data(playstore_list, 10473, 10474, True))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Number of rows: 10841
Number of columns: 13
None


In [8]:
print(playstore_list[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


## Identifying Duplicate Data Sets (Sample)

The Google PlayStore data set contains duplicate entries, we will print a few rows to confirm this hypothesis, and if the hypothesis is true, we will count the number of duplicates in the data sets.

In [9]:
for app in playstore_content:
    name = app[0]
    if name == 'Instagram':
        print(app)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




## Identifying Duplicate Data Sets (Google PlayStore)
From the above output we can see that `Instagram` appeared more than once. We will now separate the duplicated data sets from the unique data sets. After separating the data set, the duplicated data sets will not be removed.

In [10]:
duplicate_datalist = []
unique_datalist = []
for app in playstore_content:
    name = app[0]
    if name in unique_datalist:
        duplicate_datalist.append(name)
    else:
        unique_datalist.append(name)
        
print('Number of duplicated data set is: ', len(duplicate_datalist))
print('\n')
print('Number of unique data set is: ', len(unique_datalist))
print('\n')
print('The examples of duplicate data sets are: ', duplicate_datalist[:10])

Number of duplicated data set is:  1181


Number of unique data set is:  9660


The examples of duplicate data sets are:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


## Removing Duplicates

To remove the duplicates, we will:

* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
* Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).
To achieve this, we'll need to use the `not in` operator. The not in operator is the opposite of the `in` operator. Since we will be working with `playstore_content`, we need to delete the error error data like we did for `playstore_list`. 

In [11]:
del playstore_content[10472]

In [12]:
reviews_max = {}
for app in playstore_content:
    name = app[0]
    n_reviews = float(app[3])
    if (name in reviews_max) and (reviews_max[name]<n_reviews):
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))



9659


## Removing Duplicate Rows: Part Two

To remove duplicates, we use the dictionary created above to remove the duplicate rows:

* We start by creating two empty lists: `android_clean` (which will store our new cleaned data set) and `already_added` (which will just store app names).
* Loop through the Google Play data set (make sure you don't include the header row), and for each iteration:
    * Assign the app name to a variable named `name`.
    * Convert the number of reviews to float, and assign it to a variable named `n_reviews`.
- If `n_reviews` is the same as the number of maximum reviews mapped to the app name (the number can be found in the `reviews_max` dictionary) **and** `name` is not already in the list `already_added` (read the solution notebook to find out why we need this supplementary condition):
    * Append the entire row to the `android_clean` list (which will eventually be a list of list and store our cleaned data set).
    * Append the name of the app `name` to the `already_added` list — this helps us to keep track of apps that we already added.
    
Then we explore the android_clean data set to ensure everything went as expected. The data set should have 9,659 rows.

In [13]:
playstore_clean = []
already_added = []
for app in playstore_content:
    name = app[0]
    n_reviews = float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        playstore_clean.append(app)
        already_added.append(name)
print(len(playstore_clean))
print(len(already_added))

9659
9659


## Removing Non-English Apps

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Each character we use in a string has a corresponding number associated with it. The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. 

Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. 
- If the number is equal to or less than 127, then the character belongs to the set of common English characters.
* If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name.

Our app names, however, are stored as strings, so how could we take each individual character of a string and check its corresponding number?
> In Python, strings are indexable and iterable, which means we can use indexing to select an individual character, and we can also iterate on the string using a for loop.

In [14]:
def string_detector(string_entered):
    for character in string_entered:
        if ord(character) > 127:
            return False
        
    return True
        

instagram = string_detector('Instagram')
chinese = string_detector('爱奇艺PPS -《欢乐颂2》电视剧热播')
docs = string_detector('Docs To Go™ Free Office Suite')
instachat = string_detector('Instachat 😜')
print(instagram, chinese, docs, instachat)

True False False False


We saw that the function couldn't correctly identify certain English app names like `'Docs To Go™ Free Office Suite'` and `'Instachat 😜'`. This is because emojis and characters like `™` fall outside the ASCII range and have corresponding numbers over 127.



In [15]:
print(ord('™'))
print(ord('😜'))

8482
128540


-----------------------------------------------------------------
If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. We will edit the function we created in the previous screen, and then use it to filter out the non-English apps.

In [16]:
def string_detector(string_entered):
    checker = 0
    for character in string_entered:
        if ord(character) > 127:
            checker += 1
    if checker > 3:
        return False
    else:
        return True
        
chinese = string_detector('爱奇艺PPS -《欢乐颂2》电视剧热播')        
docs = string_detector('Docs To Go™ Free Office Suite')
instachat = string_detector('Instachat 😜')
print(chinese, docs, instachat)
            

False True True


We now use the new function to filter out non-English apps from both data sets. Loop through each data set. If an app name is identified as English, append the whole row to a separate list.

In [17]:
playstore_english = []
appstore_english = []

for app in playstore_clean:
    name = app[0]
    if string_detector(name):
        playstore_english.append(app)
for app in appstore_content:
    name = app[1]
    if string_detector(name):
        appstore_english.append(app)
        
print(len(playstore_english))
print(len(appstore_english))

explore_data(appstore_english, 0, 3, True)
        

9614
6183
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


In [18]:
def dataset_english(dataset):
    english_list = []
    for app in dataset:
        name = app[0]
        if string_detector(name):
            english_list.append(app)
    return english_list
            
            

print(len(dataset_english(playstore_clean)))
print(len(dataset_english(appstore_content)))


9614
7197


In [19]:
explore_data(appstore_english, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


## Isolating Free Apps

Since we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [20]:
playstore_freeapps = []
appstore_freeapps = []
for app in playstore_english:
    price_playstore = app[6]
    if (price_playstore == 'Free'):
        playstore_freeapps.append(app)
for app in appstore_english:
    price_appstore = float(app[4])
    if (price_appstore == 0):
        appstore_freeapps.append(app)

print(len(playstore_freeapps))
print(len(appstore_freeapps))

            

8863
3222


Since our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

    1. Build a minimal Android version of the app, and add it to Google Play.
    2. If the app has a good response from users, we develop it further.
    3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

In [21]:
print(appstore_freeapps[0:3])
print('\n')
print(appstore_header)
print('\n')
print(playstore_freeapps[2:5])
print('\n')
print(playstore_header)

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


[['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1

## Sorting Out Most Common Apps by Genre

We inspected the data sets to identify the columns that might be useful for finding out what the most common genres in each market are. Our conclusion was that we'll need to build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

We'll build two functions we can use to analyze the frequency tables:

    * One function to generate frequency tables that show percentages
    * Another function we can use to display the percentages in a descending order
    
        > Note that: Dictionaries don't have order, 
        
So it will be very difficult to analyze the frequency tables. We'll need to build a second function which can help us display the entries in the frequency table in a descending order.

To do that, we'll need to make use of the built-in `sorted()` function. This function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns a list of the elements of that iterable sorted in ascending or descending order (the `reverse` parameter controls whether the order is ascending or descending).

In [22]:
def freq_table(dataset, index):
    frequency_table = {}
    appearance = 0
    
    for apps in dataset:
        appearance += 1
        core_value = apps[index]
        if core_value in frequency_table:
            frequency_table[core_value] += 1
        else:
            frequency_table[core_value] = 1
   
    freqtable_percentage = {}
    for key in frequency_table:
        percentage = (frequency_table[key]/len(dataset))*100
        freqtable_percentage[key] = percentage
    
    return freqtable_percentage


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
   
print(display_table(appstore_freeapps, 11))
print('\n')
print(display_table(playstore_freeapps, 1))
print('\n')
print(display_table(playstore_freeapps, 9))

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None


FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.31716123208

We'll now focus on analyzing these frequency tables. For AppStore, it can be deduced that Games and Entertainment topped the frequency. While PlayStore is a little bit complicated, `Category` is more interactive, in `Category` we can deduce that `Tools` and `Entertainment` topped the chart.

## Most Popular Apps by Genre on AppStore

The frequency tables we analyzed on the previous screen showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. 

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot app`.

We start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

    - Isolate the apps of each genre.
    - Sum up the user ratings for the apps of that genre.
    - Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

In [23]:
genre_appstore = freq_table(appstore_freeapps, 11)

for genre in genre_appstore:
    total = 0
    len_genre = 0
    for app in appstore_freeapps:
        genre_app = app[11]
        if genre_app == genre:
            num_user_rating = float(app[5])
            total += num_user_rating
            len_genre += 1
    avg_user_rating = total/len_genre
    print(genre, ':', avg_user_rating)
    

Shopping : 26919.690476190477
Travel : 28243.8
Food & Drink : 33333.92307692308
Games : 22788.6696905016
Medical : 612.0
Weather : 52279.892857142855
Utilities : 18684.456790123455
Sports : 23008.898550724636
Education : 7003.983050847458
Book : 39758.5
Business : 7491.117647058823
Catalogs : 4004.0
Entertainment : 14029.830708661417
Social Networking : 71548.34905660378
Navigation : 86090.33333333333
Music : 57326.530303030304
Productivity : 21028.410714285714
Photo & Video : 28441.54375
Reference : 74942.11111111111
Finance : 31467.944444444445
News : 21248.023255813954
Health & Fitness : 23298.015384615384
Lifestyle : 16485.764705882353


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [24]:
for app in appstore_freeapps:
    if app[11] == 'Navigation':
        print(app[1], ':', app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [25]:
for app in appstore_freeapps:
    if app[11] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

    - Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

    - Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

    - Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

## Most Popular Apps by Genre on AppStore

For the Google Playstore, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended, with a combination of numbers and strings. (100+, 1,000+, 5,000+, etc.):

In [26]:
display_table(playstore_freeapps, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


To perform computations, however, we'll need to convert each `install` number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below using the `replace function`, where we also compute the average number of installs for each genre (category).

In [31]:
genre_playstore = freq_table(playstore_freeapps, 1)
for category in genre_playstore:
    total = 0
    len_category = 0
    for app in playstore_freeapps:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = total/len_category
    print(category, ':', avg_installs)

LIFESTYLE : 1437816.2687861272
PERSONALIZATION : 5201482.6122448975
ART_AND_DESIGN : 1986335.0877192982
EVENTS : 253542.22222222222
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
ENTERTAINMENT : 11640705.88235294
BEAUTY : 513151.88679245283
PHOTOGRAPHY : 17840110.40229885
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
DATING : 854028.8303030303
NEWS_AND_MAGAZINES : 9549178.467741935
PARENTING : 542603.6206896552
EDUCATION : 1833495.145631068
COMMUNICATION : 38456119.167247385
TRAVEL_AND_LOCAL : 13984077.710144928
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
HOUSE_AND_HOME : 1331540.5616438356
MAPS_AND_NAVIGATION : 4056941.7741935486
SOCIAL : 23253652.127118643
WEATHER : 5074486.197183099
TOOLS : 10801391.298666667
FOOD_AND_DRINK : 1924897.7363636363
PRODUCTIVITY : 16787331.344927534
SPORTS : 3638640.1428571427
GAME : 15588015.603248259
HEALTH_AND_FITNESS : 4188821.9853479853
BUSINESS : 1712290.1474201474
LIBRARIES_AND_DEMO : 638503.73