# Free to Download App Analysis

This project is about free to download mobile apps and the goal of this project is to identify which type of apps are likely to attract the most users. Using that information developers can create popular apps that generate revenue through in-app advertisements.

There are two free datasets that we will be using for our analysis.

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

# Exploration

Let's start by getting to understand the data that we're working with. We'll need to create functions to help open and explore the data.

In [1]:
# Function for opening, reading and listing a csv
def open_read_list(file_name):
    opened_file = open(file_name)
    from csv import reader
    read_file = reader(opened_file)
    app_data = list(read_file)
    return app_data

#Function for exploring data
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Next, lets splice our data and save them as variables that are easy to reference.

In [2]:
# Saving datasets to variables we can reference
ios = open_read_list('AppleStore.csv')
android = open_read_list('googleplaystore.csv')

#ios variables
ios_header = ios[0]
ios_data = ios[1:]

#android variables
android_header = android[0]
android_data = android[1:]

Now that we have our functions and variables set up, let's explore the datasets starting with the ios apps.

In [3]:
#exploring ios apps
print(ios_header)
print('\n')
explore_data(ios_data, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


We can see that there are 7197 apps in the ios dataset and 16 different columns. Of the columns, Size, Price, User Ratings, Content Rating, and Prime Genre looks like they will be useful for our analysis. A description of the ios columns can be found in the the columns section of this [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

Next, lets explore the android apps.

In [4]:
#exploring andoid apps
print(android_header)
print('\n')
explore_data(android_data, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


The android dataset contains 10,841 apps and 13 different columns. At first glance, Category, Rating, Reviews, Size, Installs, Price, Content Ratings, and Genres will be useful in our analysis. A description of the columns can be found in the columns section of this [link](https://www.kaggle.com/lava18/google-play-store-apps).

# Data Cleaning

Now that we have a sense of the data we're working with we need to clean it. Without cleaning, the results of our analysis will be wrong. Data Cleaning involves:

- Removing or correction wrong data
- Removing duplicate data
- And modifying the data to fit the purpose of our analysis

## Incorrect Data

Some datasets have a discussion section. We can start there to see if anyone else has found an error in the data. For the andoid data [one discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) describes missing data in row or index 10472. Let's verify it.

### Incorrect Data - android

In [5]:
x = 10472
print(android_data[x])
print('\n')
print('row length: ',len(android_data[x]))
print('\n')
print('header length: ',len(android_header))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


row length:  12


header length:  13


We can see that index 10472 (row 10473) is missing 1 piece of data in comparison by comparing the length to the length of the header.

Using that same logic, let's check the whole dataset for rows with lengths not equal to the length of the header.

In [6]:
counter = -1
print('Header Length: ',len(android_header))
print('\n')

for row in android_data:
    counter += 1
    if len(android_header) != len(row):
        print('Index #: ', counter)
        print('Row Length: ',len(row))
        print(row)
        print('\n')

Header Length:  13


Index #:  10472
Row Length:  12
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




We've verfied that the the only row missing data in the android dataset is index 10472. Let's delete it from our dataset using the del function. We need to make sure we don't run this statement more than once, otherwise we'll delete more than one row.

In [7]:
# deleting row in android dataset with missing data
del android_data[10472]

Let's run our check again to see if we have missing data.

In [8]:
counter = -1
print('Header Length: ',len(android_header))
print('\n')

for row in android_data:
    counter += 1
    if len(android_header) != len(row):
        print('Index #: ', counter)
        print('Row Length: ',len(row))
        print(row)
        print('\n')

Header Length:  13




Nothing show's up, this means that we've successfully removed the row with missing data. Let's do the same with the ios dataset.

### Incorrect Data - ios

In [9]:
counter = -1
print('Header Length: ',len(ios_header))
print('\n')

for row in ios_data:
    counter += 1
    if len(ios_header) != len(row):
        print('Index #: ', counter)
        print('Row Length: ',len(row))
        print(row)
        print('\n')

Header Length:  16




Looks like the ios dataset doesn't have any rows with missing or extra data. Let's move on to the next step.

## Duplicate Data

The discussions in the ios dataset mention a possibility of duplicate data. We'll explore both the ios and android datasets to identify and remove any duplicates.

To identify duplicates, we'll loop through the name column of each dataset and separate the unique and duplicate apps into their own lists. Let's start with the ios dataset.

### Duplicate Data - ios

In [10]:
unique_apps = []
duplicate_apps = []

for row in ios_data:
    name = row[1]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

duplicate_apps

['Mannequin Challenge', 'VR Roller Coaster']

We can see that `Mannequin Challenge` and `VR Roller Coaster` have duplicates. Let's show the rows that have these names.

In [11]:
for row in ios_data:
    name = row[1]
    if name == 'Mannequin Challenge' or name == 'VR Roller Coaster':
        print(row)

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


It looks like we only have two duplicates in the ios dataset but let's verify by using the `len()` function on `duplicate_apps`.

In [12]:
print(len(duplicate_apps))

2


We'll need to remove these duplicates and come up with a criteria to determine which one to remove. For now, let's run through the same exercise with the android dataset.

### Duplicate Data - android

In [13]:
unique_apps = []
duplicate_apps = []

for row in android_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('duplicates:',len(duplicate_apps))
print('uniques:',len(unique_apps))

duplicates: 1181
uniques: 9659


We can see that the android dataset has a significantly higher count of duplicates than ios. Let's get to correcting these!

### Deleting Duplicates - android

For duplicates, were only going to keep the row with the highest number of reviews because it is a good proxy for being the most recent entry.

We'll loop through the data and create a dictionary of the app names and the highest number of reviews for that app. If done correctly we should only have 9,659 rows.

In [14]:
#create dictionary to store app names and the highest number of reviews
reviews_max = {}

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9659


In the next step, we'll loop through our dataset and keep the most recent entry for each app based on number of reviews. To do so we'll create a list to hold the data we want to capture and a list of apps already added accounted for. The second list helps exclude entering apps with multiple entries with the same highest number of reviews. We'll check to see if we have 9,659 rows again.

In [15]:
#create two empty lists
android_clean = []
already_added = []

for row in android_data:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))

9659


Let's do the same for the ios apps.

### Deleting Duplicates - ios

Let's borrow our code from above and modify it for the ios dataset. There are a total 7,197 apps in our dataset and we expect to remove 2 duplicates meaning we should have 7,195 in our cleaned dataset.

In [16]:
#create dictionary to store app names and the highest number of reviews
reviews_max = {}

for row in ios_data:
    name = row[1]
    n_reviews = float(row[5])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

#create two empty lists
ios_clean = []
already_added = []

for row in ios_data:
    name = row[1]
    n_reviews = float(row[5])
    if n_reviews == reviews_max[name] and name not in already_added:
        ios_clean.append(row)
        already_added.append(name)
        
print(len(ios_clean))

7195


After cleaning the data of incorrect and duplicate entries, we want to modify the dataset for our analysis. We want to analyze data that will help us build apps for an english speaking audience. Our dataset may contain apps that our catered towards non-english speaking audiences. Let's identify and remove those from out dataset.

## Modifying Dataset - English-speaking Apps

### Modifying Dataset - English-speaking - android

Each character has a corresponding code that can be identified using the built-in `ord()` function. Characters that are typically used in English text fall in the range of 0-127 according to the  [ASCII](https://en.wikipedia.org/wiki/ASCII) (American Standard Code for Information Interchange) system. 

Let's create a function that can identify apps with non-english names. We'll start with a function that can identify non-english characters within a string and then nest it in a function that loops through apps in a given dataset.

In [17]:
# create function
def eng_verify(string):
    count = 0
    for character in string:
        if ord(character) > 127:
            count += 1
    if count > 3:
        return False
    else:
        return True

#test function
print(eng_verify('Instagram'))
print(eng_verify('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(eng_verify('Docs To Go™ Free Office Suite'))
print(eng_verify('Instachat 😜'))

True
False
True
True


In the function above, check the characters of a the string to see if they fall within our desired code range if 0-127. Some english-speaking apps use emojis and special characters fall outside of that range, so ww will make an exception for apps that have 3 or less characters that fall outside the range. This isn't perfect but it's more effective.

Now let's nest this function within a loop to organize our android dataset into 2 lists: english apps and non-english apps.

In [18]:
#create 2 empty list, 1 to store desired data, 1 for data we dont want
android_clean_eng = []
android_clean_non = []

for app in android_clean:
    name = app[0]
    if eng_verify(name):
        android_clean_eng.append(app)
    else:
        android_clean_non.append(app)

explore_data(android_clean_eng,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


In the android dataset we are left qith 9,614 apps after removing the non-english apps. Let's explore the non-english list to make sure that we filtered out the apps we wanted.

In [19]:
explore_data(android_clean_non,0,3,True)

['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']


['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']


['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up']


Number of rows: 45
Number of columns: 13


From the examples above we can see that the names of the apps we filtered out have more than 3 non-english characters. This is what we intended, so we can now move on to the ios dataset.

### Modifying Dataset - English-speaking - ios

Lets run the same loop and function on the ios and see what we get.

In [20]:
ios_clean_eng = []
ios_clean_non = []

for app in ios_clean:
    name = app[1]
    if eng_verify(name):
        ios_clean_eng.append(app)
    else:
        ios_clean_non.append(app)

explore_data(ios_clean_eng,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6181
Number of columns: 16


Prior to this step, the ios dataset had 7,195 apps. After cleaning for non-english apps we are left with 6,181 apps. Let's check the non-english ios apps to be thorough.

In [21]:
explore_data(ios_clean_non,0,3,True)

['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']


['405667771', '聚力视频HD-人民的名义,跨界歌王全网热播', '90725376', 'USD', '0.0', '7446', '8', '4.0', '4.5', '5.0.8', '12+', 'Entertainment', '24', '4', '1', '1']


['336141475', '优酷视频', '204959744', 'USD', '0.0', '4885', '0', '3.5', '0.0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1']


Number of rows: 1014
Number of columns: 16


It looks like code above did what we wanted, so let's continue to the next step of modifying our dataset.

## Modifying Dataset - Free Apps

In the intro we mentioned the goal of the project was to analyze apps that are free to download. Our final cleaning step will be to isolate only the free apps for our analysis.

### Modifying Dataset - Free Apps - Android

There are two columns that we can use to determine which apps are free. The `Type` and `Price` columns have information that can help us determine whether or not apps are free. We will only include apps that both have a 'Free' type and Price of 0.0 in our `android_free` dataset.

In [22]:
# create our empty lists
android_free = []
android_nonfree = []

# loop through dataset and isolate free apps based on type and price
for app in android_clean_eng:
    pay_type = app[6]
    if pay_type == 'Free':
        price = float(app[7])
        if price == 0.0:
            android_free.append(app)
    else:
        android_nonfree.append(app)

# explore lists
explore_data(android_free,0,2,True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8863
Number of columns: 13


After isolating apps with both a 'Free' `Type` and 0.0 for `Price`, we're left with 8,863 apps. Now let's move on to isolating the free apps in the ios dataset.

### Modifying Dataset - Free Apps - ios

In our ios dataset we only have the price column to indentify free vs. non-free apps.

In [23]:
#create our lists
ios_free = []
ios_nonfree = []

for app in ios_clean_eng:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)
    else:
        ios_nonfree.append(app)
        
explore_data(ios_free,0,2,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 3220
Number of columns: 16


After isolating apps with a `price` of 0.0 we are left with 3,220 free apps in the ios dataset. 

To summarize, these are the following size of our datasets we'll use to analyze for our project.

| Dataset | Apps |
| --- | --- |
| android | 8,863 |
| ios | 3,220 |

As a final step in our data cleaning, we'll assign the datasets a final dataset variable for the sake of auditability.

In [24]:
ios_final = ios_free
android_final = android_free

# Analysis

Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

Let's start by exploring our data again to see what columns will be useful in this portion of our analysis.

## Exploration

In [25]:
print(ios_header)
print(ios_final[:1])

print('\n')

print(android_header)
print(android_final[:1])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']]


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']]


From the ios dataset, `prime_genre` (index 11) will be useful and in the android dataset we can use both `Category` and `Genres`. Our next step will be to create dictionaries and determine the percentages of the genres for each dataset.

## Analysis Functions

In [26]:
# create a function to create a frequency table
def freq_table(dataset, index):
    groups = {}
    total_apps = 0
    check = 0
    for app in dataset:
        datapoint = app[index]
        total_apps += 1
        if datapoint in groups:
            groups[datapoint] += 1
        else:
            groups[datapoint] = 1
            
    for group in groups:
        # turn into % decimal format
        groups[group] /= total_apps
        # turn into % percentage format
        groups[group] *= 100
    return groups

Let's check our function to make sure it works the way we expect it to. We'll need to verify that the percentages sum to 100%.

In [27]:
ios_prime_genre = freq_table(ios_final, 11)
android_category = freq_table(android_final, 1)
android_genres = freq_table(android_final, 9)


def check_percentage(table):
    table_check = 0
    for row in table:
        table_check += table[row]
    print(table_check)
    print('\n')

check_percentage(ios_prime_genre)
check_percentage(android_category)
check_percentage(android_genres)

99.99999999999999


100.00000000000003


100.0000000000001




The sum of all the percentages add up to 100% disregarding minor calculation limitations. This function is good to go. The next step is to display it in an order that makes sense. We'll insert our `freq_table` function inside our new `display_table` function to sort our data in descending order.

In [28]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let's plug our dataset into the new function to generate our sorted tables. We'll start by analyzing the free english apps from the app store.

## Analysis - % of IOS genres

In [29]:
ios_prime_genre_sorted = display_table(ios_final, 11)

Games : 58.13664596273293
Entertainment : 7.888198757763975
Photo & Video : 4.968944099378882
Education : 3.6645962732919255
Social Networking : 3.291925465838509
Shopping : 2.608695652173913
Utilities : 2.515527950310559
Sports : 2.142857142857143
Music : 2.049689440993789
Health & Fitness : 2.018633540372671
Productivity : 1.7391304347826086
Lifestyle : 1.5838509316770186
News : 1.3354037267080745
Travel : 1.2422360248447204
Finance : 1.1180124223602486
Weather : 0.8695652173913043
Food & Drink : 0.8074534161490683
Reference : 0.5590062111801243
Business : 0.5279503105590062
Book : 0.43478260869565216
Navigation : 0.18633540372670807
Medical : 0.18633540372670807
Catalogs : 0.12422360248447205


The most common genre of apps on the App Store is Games followed by Entertainment. Most of the free english apps on the App Store seem to be catered for leisure use (games, photo and video, social networking, sports, music).

Although games are the most popular genre of apps, this doesn't necessarily mean they will have the most number of users. There are some assumptions we would need to make if we wanted to reach that conclusion. For now, let's move on and analyze the android `Category` and `Genres` data.

## Analysis - % of Android Cateegories and Genres

### Category

In [30]:
android_category_sorted = display_table(android_final, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

### Genres

In [31]:
android_genre_sorted = display_table(android_final, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

The frequency tables we analyzed above showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps. Now, we'd like to get an idea about the kind of apps with the most users.

## Analysis - # of Users

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.

### # of Users - IOS

In [32]:
# number of apps in a genre for avg calulation
genres_ios = freq_table(ios_final, 11)
        
# loop through dataset to count numnbe
for genre in genres_ios:
    total = 0
    len_genre = 0
    for apps in ios_final:
        genre_app = apps[11]
        if genre_app == genre:
            n_user_rating = float(apps[5])
            total += n_user_rating
            len_genre += 1
    avg_n_users = total / len_genre
    print(genre, ":", avg_n_users)

Entertainment : 14029.830708661417
Music : 57326.530303030304
Medical : 612.0
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Catalogs : 4004.0
Weather : 52279.892857142855
Photo & Video : 28441.54375
Games : 22812.92467948718
Book : 39758.5
Reference : 74942.11111111111
Food & Drink : 33333.92307692308
Navigation : 86090.33333333333
Utilities : 18684.456790123455
Social Networking : 71548.34905660378
Business : 7491.117647058823
Productivity : 21028.410714285714
Education : 7003.983050847458
News : 21248.023255813954
Shopping : 26919.690476190477
Lifestyle : 16485.764705882353
Finance : 31467.944444444445
Travel : 28243.8


From the results above, there are five genres that have more than 50,000 average users per app with Navigation leading with 86,090 avg users per app. I've added them below for easier viewing.

| Genres | Avg Users |
| --- | --- |
| Navigation | 86,090 |
| Reference | 74,942 |
| Social Networking | 71,548 |
| Music | 57,326 |
| Weather | 52,279|

From my initial inspection, I'd recommend building a **Weather** app because it seems less costly and requires less time to build. **Reference** may also be a good option but seems vague. Let's investigate both further.


In [33]:
# look through ios_final and show number of users for Weather apps
weather_apps = {}
for apps in ios_final:
    genre = apps[11]
    app_name = apps[1]
    users = apps[5]
    if genre == "Weather":
        weather_apps[app_name] = users

weather_apps

{'AccuWeather - Weather for Life': '144214',
 'Almanac Long-Range Weather Forecast': '12',
 'FEMA': '128',
 'Forecast Bar': '375',
 "Freddy the Frogcaster's Weather Station": '14',
 'Hurricane Tracker WESH 2 Orlando, Central Florida': '203',
 'Hurricane by American Red Cross': '1158',
 'JaxReady': '22',
 'Moji Weather - Free Weather Forecast': '2333',
 'MyRadar NOAA Weather Radar Forecast': '150158',
 'Météo-France': '24',
 'NOAA Weather Radar - Weather Forecast & HD Radar': '45696',
 'QuakeFeed Earthquake Map, Alerts, and News': '6081',
 'Storm Radar': '22792',
 'The Weather Channel App for iPad – best local forecast, radar map, and storm tracking': '208648',
 'The Weather Channel: Forecast, Radar & Alerts': '495626',
 'TodayAir': '0',
 'WRAL Weather Alert': '25',
 'WarnWetter': '0',
 'Weather & Radar': '37',
 'Weather - Radar - Storm with Morecast App': '78',
 'Weather Live Free - Weather Forecast & Alerts': '35702',
 'Weather Underground: Custom Forecast & Local Radar': '49192',
 'W

In terms of users, **Weather** apps seem like a hit or miss. It's a possibility, however, let's look into **Reference** apps as well.

In [34]:
# look through ios_final and show number of users for Reference apps
reference_apps = {}
for apps in ios_final:
    genre = apps[11]
    app_name = apps[1]
    users = apps[5]
    if genre == "Reference":
        reference_apps[app_name] = users

reference_apps

{'Bible': '985920',
 'City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE)': '8535',
 'Dictionary.com Dictionary & Thesaurus': '200047',
 'Dictionary.com Dictionary & Thesaurus for iPad': '54175',
 'GUNS MODS for Minecraft PC Edition - Mods Tools': '1497',
 'Google Translate': '26786',
 'Guides for Pokémon GO - Pokemon GO News and Cheats': '826',
 'Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free': '718',
 'Jishokun-Japanese English Dictionary & Translator': '0',
 'LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools': '4693',
 'Merriam-Webster Dictionary': '16849',
 'Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran': '18418',
 'New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition': '17588',
 'Night Sky': '12122',
 'Real Bike Traffic Rider Virtual Reality Glasses': '8',
 'VPN Express': '14',
 'WWDC': '762',
 '教えて!goo': '0'}

It seems like the Reference genre includes dictionaries, translators, and similar reference apps. They also seem to be hit or miss like the Weather apps, however, I believe these may be a better choice to build. Reference apps will not need to be updated as frequently as Weather apps. Let's check out android apps now.

### # of Users - Android

After a quick inspection of the installs column, most of the values are open-ended (100+, 1,000+, 5,000+, etc.) We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from string to float. This means we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error.

Now let's calculate the average number of installs per app genre for the Google Play data set.

In [35]:
#Generate table for android categories
android_category = freq_table(android_final, 1)

#Generate empty table to track avg users
android_category_users = {}

#loop through table and dataset to track avg users
for category in android_category:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            installs_string = app[5]
            installs_string = installs_string.replace('+','')
            installs_string = installs_string.replace(',','')
            installs = float(installs_string)
            total += installs
            len_category += 1
    avg_installs = total / len_category
    android_category_users[category] = avg_installs

#Function to sort data in descending order
def display_table(table):
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#display top avg users for categories
display_table(android_category_users)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

We can see that the the most popular categories are COMMUNICATION and VIDEO PLAYERS, however, these are probably heavily skewed by the a handful of apps made by giant companies. Additionally, we're looking for apps that we can recommend to develop on both Android & ios. BOOKS_AND_REFERENCE as well as WEATHER come in pretty high at 8.8M and 5.1M average users, respectively. Let's take a further look into these.

### Android - Books & Referene Apps

In [39]:
# look through ios_final and show number of users for Reference apps
books_and_reference_apps = {}
for apps in android_final:
    category = apps[1]
    app_name = apps[0]
    if category == "BOOKS_AND_REFERENCE":
        installs_string = app[5]
        installs_string = installs_string.replace('+','')
        installs_string = installs_string.replace(',','')
        installs = float(installs_string)
        books_and_reference_apps[app_name] = installs

books_and_reference_apps

{'50000 Free eBooks & Free AudioBooks': 10000000.0,
 'A-J Media Vault': 10000000.0,
 'AC Air condition Troubleshoot,Repair,Maintenance': 10000000.0,
 'AE Bulletins': 10000000.0,
 'AP Stamps and Registration': 10000000.0,
 'AW Tozer Devotionals - Daily': 10000000.0,
 'AY Sing': 10000000.0,
 'Aab e Hayat Full Novel': 10000000.0,
 'Ae Allah na Dai (Rasa)': 10000000.0,
 'Ag PhD Deficiencies': 10000000.0,
 'Ag PhD Field Guide': 10000000.0,
 'Ag PhD Planting Population Calculator': 10000000.0,
 'Ag PhD Soybean Diseases': 10000000.0,
 'Al Quran (Tafsir & by Word)': 10000000.0,
 'Al Quran : EAlim - Translations & MP3 Offline': 10000000.0,
 'Al Quran Al karim': 10000000.0,
 'Al Quran Indonesia': 10000000.0,
 "Al'Quran Bahasa Indonesia": 10000000.0,
 'Al-Muhaffiz': 10000000.0,
 'Al-Quran (Free)': 10000000.0,
 'Al-Quran 30 Juz free copies': 10000000.0,
 'AlReader -any text book reader': 10000000.0,
 'Aldiko Book Reader': 10000000.0,
 'All Language Translator Free': 10000000.0,
 'All Maths Formula

Above we can see that there are a wide variety of apps in the BOOKS_AND_REFERENCE category that each have more than 1M installs. Let's examine WEATHER as well.

### Android - Books & Referene Apps

In [40]:
# look through ios_final and show number of users for Reference apps
weather_android_apps = {}
for apps in android_final:
    category = apps[1]
    app_name = apps[0]
    if category == "WEATHER":
        installs_string = app[5]
        installs_string = installs_string.replace('+','')
        installs_string = installs_string.replace(',','')
        installs = float(installs_string)
        weather_android_apps[app_name] = installs

weather_android_apps

{"AEMET's time": 10000000.0,
 'APE Weather ( Live Forecast)': 10000000.0,
 'AccuWeather: Daily Forecast & Live Weather Reports': 10000000.0,
 'Ag Weather Tools': 10000000.0,
 'Amber Weather': 10000000.0,
 'Au Weather Free': 10000000.0,
 'ByssWeather for Wear OS': 10000000.0,
 'Clearwater, FL - weather and more': 10000000.0,
 'Climatempo Lite - 15 day weather forecast': 10000000.0,
 'DS Barometer - Altimeter and Weather Information': 10000000.0,
 'DS Thermometer': 10000000.0,
 'EZ Clock & Weather Widget': 10000000.0,
 'FR Tides': 10000000.0,
 'Florida Storms': 10000000.0,
 'ForecaWeather': 10000000.0,
 'Free live weather on screen': 10000000.0,
 'Fu*** Weather (Funny Weather)': 10000000.0,
 'GO Weather - Widget, Theme, Wallpaper, Efficient': 10000000.0,
 'GO Weather EX Theme White': 10000000.0,
 'HTC Weather': 10000000.0,
 'HumorCast - Authentic Weather': 10000000.0,
 'Info BMKG': 10000000.0,
 'Klara weather': 10000000.0,
 "Klart.se - Sweden's best weather": 10000000.0,
 'Live Weather &

Here we can see a similar result. Many of the individual apps that fall under the WEATHER category have over 1M installs.

Reference and Weather apps both seem to be good options to recommend for building on both the android and ios platforms. For Reference apps, religious books such as the Quaran and Bible are popular so perhaps building an app based on another religous book can generate large numbers of users. As for Weather, there seems to be weather apps that cover a large area and apps that cover specific regions or types of weather (such as storms). Perhaps building a weather app for a specific region with high population or specific weather not already covered on the android app store could generate large number of users.

# Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a religious book and turning it into an app could be profitable for both the Google Play and the App Store markets. Additionally, creating a weather app based on a densley populated region or a weather app based on a specific type of weather can also be profitable.