# Analyzing Mobile App Data 

This project is about anaylizing apps in the Apple store and Google play store. The purpose is to find out what app a company can make to make profits. 

Goal: The goal of this app is to learn how to find different patterns with data. This is my first project using python and its a good project to learn how to use python basics. I am also diving into data science which is a first for me.

## Opening the Data


As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Collecting data for over four million apps requires a significant amount of time and money, so we'll analyze a sample of data instead. Fortunately, the internet exist and data for these two App stores already exist:

[A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from this [link](https://uflorida-my.sharepoint.com/:x:/g/personal/gferioli_ufl_edu/EbU6sGESM89Ar84gShA4vJ4Bhs9dBBC4YOwiH_6uaH918g?rtime=Av1hNa6i10g).

[A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this [link](https://uflorida-my.sharepoint.com/:x:/g/personal/gferioli_ufl_edu/EV6QvMy17AxCiWiMFQULZaMBcmoTUSlAf8CToPyvU7Ysqg?e=kuneaE).

First we open the two data sets.

In [145]:
from csv import reader

### The Google Play data set ###
opened = open('googleplaystore.csv',encoding='utf8')
read = reader(opened)
android = list(read)
android_header = android[0]# The names of the columns
android = android[1:]# The data

### The App Store data set ###
opened = open('AppleStore.csv',encoding='utf8')
read = reader(opened)
ios = list(read)
ios_header = ios[0]# The names of the columns
ios = ios[1:]# The data


Below we introduce a function, explore_data, which helps us parse the data. This function also has the added benefit of printing the properites of each app and how many apps are in the data set.

In [146]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line between rows
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
print(android_header)
print('\n')
explore_data(android, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Above is a glance at some of the data in the Google Play Store. We see that the Google Play data set has 10841 apps and 13 columns.

At a quick glance, the columns that might be useful for the anaylis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

Now lets look at the iOS App store.

In [147]:
print(ios_header)
print('\n')
explore_data(ios, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


We have 7197 iOS apps and 16 columns in this data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. Some Columns are not obvious as to what they mean, but information about them can be found in this [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

## Data Cleaning

It's often said that data scientists spend around 80% of their time cleaning data, and only about 20% actually analyzing (cleaned) data. In this section we will clean our data by removing App duplicate or inaccurate data.

The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that one of the [discussion outlines](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) an error for row 10472.



In [148]:
print(android[10472])  # incorrect row
print('\n')
print(android_header)  # header
print('\n')
print(android[0])      # example of correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


The incorrect row has a rating of 19, which is impossible when the max rating is 5.0. Lets delete this data row.

In [149]:
print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


## Removing Duplicates

### Part 1 Google Play Store

If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [150]:
print(android_header)
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Below we calculate the amount of duplicate apps just in the Google play data set.

In [151]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In the 4 instagram duplicates, the main difference between all of them is the number of reviews. What we should do is delete the all of the duplicates except the one with the most reviews. The more reviews we have the better our data will be.

To do this, we will use dictionary and find the highest review for each app in the data set.

In [152]:
max_reviews = {}

for app in android:
    name = app[0]
    num_reviews = float(app[3])
    if name in max_reviews and max_reviews[name]< num_reviews:
        max_reviews[name] = num_reviews
    else:
        max_reviews[name] = num_reviews
        

To check if we did the process right, we can check the number of reviews minus the duplicates.

In [153]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(max_reviews))
if len(android) - 1181 == len(max_reviews):
    print("One step Closer!")

Expected length: 9659
Actual length: 9659
One step Closer!


Now that we have the data sets we want, we clean the original data.

In [154]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    num_reviews = float(app[3])
    
    if (max_reviews[name] == num_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name) # make sure we dont include it twice
        

Lets explore the clean data.

In [155]:
explore_data(android_clean, 0, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


We can verify the data is clean because the number of rows matches the length of maxReviews

### Part 2 - iOS 

Now we will do the same exact steps to find the duplicates in the iOS data set.

In [156]:
duplicate_apps = []
unique_apps = []

for app in ios:
    name = app[1]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 2


Examples of duplicate apps: ['Mannequin Challenge', 'VR Roller Coaster']


In [157]:
reviews_max = {}

for app in ios:
    name = app[1]
    n_reviews = float(app[5])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In [158]:

print('Expected length:', len(ios) - 2)
print('Actual length:', len(reviews_max))

Expected length: 7195
Actual length: 7195


In [159]:
ios_clean = []
already_add = []

for app in ios:
    name = app[1]
    n_reviews = float(app[5])
    
    if (reviews_max[name] == n_reviews) and (name not in already_add):
        ios_clean.append(app)
        already_add.append(name) 

In [160]:
explore_data(ios_clean, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7195
Number of columns: 16


We have finished cleaning all the duplicates from both data sets.

## Removing Non-English Apps

###  Part 1 - Google Play Store 

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [161]:
print(ios[813][1])
print(ios[6731][1])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
DEM DZ


Our project is not interested in keeping the Non-English apps. Therefore, to remove them, we will use the function English shown below. 

The strategy to remove them is by using the [ASCII standard](https://www.computerhope.com/jargon/a/ascii.htm). If you look at the chart you can determine which characters are part of the english language and which are not. For our purposes, any character above 127 is not part of the english language. 

In [162]:
def English(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print(English('Instagram'))
print(English('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


What about some data points the have emoji's or other non-ASCII characters that are still considered English?

In [163]:
print(English('Docs To Go™ Free Office Suite'))
print(English('Instachat 😜'))

print(ord('™'))
print(ord('😜'))

False
False
8482
128540


In order to fix this we have to update our English function. What we should do is check if an app has more than three NON-ASCII characters. like this we restirct the huge amount of data loss.

In [164]:
def English(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(English('Docs To Go™ Free Office Suite'))
print(English('Instachat 😜'))

True
True


This function is NOT perfect but its good enough for our purposes. We can get much more into detail if we have an extended version of the ASCII table.

Below we use the function to filter through most non-English apps.

In [165]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if English(name):
        android_english.append(app)
        
for app in ios_clean:
    name = app[1]
    if English(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Now that we have removed the non-usable data, we end up with 9614 apps in the play store and 6181 in the App Store.


## Categorizing Apps

In [166]:
android_free = []
android_cost = []
ios_free = []
ios_cost = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
    else:
        android_cost.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
    else:
        ios_cost.append(app)
        
print("Free android apps:",len(android_free))
print("Paid android apps:",len(android_cost))
print("Total android apps:", len(android_english))
print("Free + paid android apps:",len(android_free)+len(android_cost))
print("\n")
print("Free iOS apps:",len(ios_free))
print("Paid iOS apps:",len(ios_cost))
print("Total iOS apps:", len(ios_english))
print("Free + paid iOS apps:",len(ios_free)+len(ios_cost))

Free android apps: 8864
Paid android apps: 750
Total android apps: 9614
Free + paid android apps: 9614


Free iOS apps: 3220
Paid iOS apps: 2961
Total iOS apps: 6181
Free + paid iOS apps: 6181


As seen above, we seperated the costly apps from the free apps and made sure the length of both lists add up to the total data set.

## Most Common Apps by Genre

### Part 1 Popular Genres

So far, we spent a good amount of time on cleaning data, and:

    Removed inaccurate data
    Removed duplicate app entries
    Removed non-English apps
    Isolated the free and costly apps
    
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To determine what are the most common apps we should build a frequency table. In order to do this we should use a dictionary as seen below.

I also included a sort function because it is something we will be using a lot to make data more readable. It is good to make functions for repetitive stuff and let the computer do the work.

In [167]:
def sort(dataset):
    sorted_stuff = sorted(dataset.items(), key=lambda item: item[1], reverse=True)
    return sorted_stuff

def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)        
    table_sorted = sort(table)
    for entry in table_sorted:
        print(entry[0],':',round(entry[1],2),'%')

In [168]:
display_table(android_free,1)
print("-------------------------------")
display_table(android_cost,1)


FAMILY : 19.22 %
GAME : 9.51 %
TOOLS : 8.46 %
BUSINESS : 4.58 %
LIFESTYLE : 3.9 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.54 %
SPORTS : 3.42 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.25 %
HEALTH_AND_FITNESS : 3.07 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.78 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.13 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
ENTERTAINMENT : 0.88 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %
-------------------------------
FAMILY : 24.27 %
GAME : 10.93 %
MEDICAL : 10.93 %
PERSONALIZATION : 10.8 %
TOOLS : 10.4 %
PRODUCTIVITY : 3.73 %
BOOKS_AND_REFERENCE : 3.73 %
COMMUNICATION : 3.6 %
SPORTS : 3.2 %
PHOTOGRAPHY : 2.53 %
LIFESTYLE : 2.4 %
FINANCE : 2.27 %
HEALTH_AND_FITNESS : 2.0 %
BUSINESS : 1.6 %
TRAVEL

In [169]:
print("FREE_IOS:")
print('\n')
display_table(ios_free,11)
print('\n')
print("COST_IOS:")
print('\n')
display_table(ios_cost,11)

FREE_IOS:


Games : 58.14 %
Entertainment : 7.89 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.52 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.34 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


COST_IOS:


Games : 51.27 %
Education : 9.86 %
Entertainment : 6.59 %
Photo & Video : 6.11 %
Utilities : 4.46 %
Productivity : 3.78 %
Health & Fitness : 3.38 %
Music : 2.4 %
Lifestyle : 1.62 %
Weather : 1.38 %
Book : 1.38 %
Business : 1.22 %
Reference : 1.18 %
Sports : 1.18 %
Navigation : 0.74 %
Travel : 0.68 %
Social Networking : 0.68 %
Food & Drink : 0.61 %
Medical : 0.51 %
News : 0.47 %
Finance : 0.44 %
Shopping : 0.03 %
Catalogs : 0.03 %


The Top 5 popular genres in the App store for free apps are: Games with 58.14%, Entertainment with 7.89%, Photo & Video wiht 4.97%, Education with 3.66%, and Social Networking with 3.29%

The Top 5 popular genres in the App store for costly apps are: Games with 51.27%, Education with 9.86%, Entertainment with 6.59%,Photo & Video with 6.11%, Utilities with 4.46%

In the App Store, there seems to be a surplus amount of Game apps compared to other types of apps. The App store also seems to be more comprised of enteraintment style apps(games, entertainment, photo and video, social networking, sports, music, etc.)

This type of data can be a good indication for the Audience the app store tries to target. Most kids between ages of 5 and 18. However, we will need more data than just the number of apps for this conclusion to be statistically significant.

This still doesnt answer our first questions of which apps are more likely to make profit. All this data does is tell us which types of apps are more popular.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

### Part 2 Downloads 

For this next part, we will find the average number of downloads/installations for each Genre. This will help us determine which types of apps are being downloaded more often.

For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. However, we can take the total number of user ratings as an estimated installations, which we can find in the rating_count_tot app.

In [170]:
Fand = freq_table(android_free,1)
Cand = freq_table(android_cost,1)
Fios = freq_table(ios_free,1)
Cios = freq_table(ios_cost,1)

def review_installs(dataset,index = -5):
    installs = {}
    sorted_installs={}
    dictionary = freq_table(dataset,index)
    for genre in dictionary:
        total = 0
        length = 0
        for app in dataset:
            genre_app = app[index]
            if genre_app == genre:
                ratings = float(app[5])
                total += ratings
                length+=1
        average = round(total/length,2)
        installs[genre] = average
    sorted_installs = sort(installs)
    
    for genre in sorted_installs:
        print(genre[0],":",genre[1])


review_installs(ios_free)     
print('----------------------------------')

review_installs(ios_cost)

Navigation : 86090.33
Reference : 74942.11
Social Networking : 71548.35
Music : 57326.53
Weather : 52279.89
Book : 39758.5
Food & Drink : 33333.92
Finance : 31467.94
Photo & Video : 28441.54
Travel : 28243.8
Shopping : 26919.69
Health & Fitness : 23298.02
Sports : 23008.9
Games : 22812.92
News : 21248.02
Productivity : 21028.41
Utilities : 18684.46
Lifestyle : 16485.76
Entertainment : 14029.83
Business : 7491.12
Education : 7003.98
Catalogs : 4004.0
Medical : 612.0
----------------------------------
Games : 6695.86
Business : 4043.47
News : 3872.36
Weather : 3248.41
Music : 2759.2
Shopping : 2722.0
Health & Fitness : 2679.85
Photo & Video : 2531.52
Reference : 2400.37
Productivity : 2247.93
Entertainment : 2131.51
Utilities : 1326.68
Catalogs : 1309.0
Navigation : 1174.59
Lifestyle : 902.77
Finance : 882.85
Medical : 663.73
Education : 640.97
Travel : 602.95
Food & Drink : 579.5
Social Networking : 393.0
Book : 320.41
Sports : 253.74


Above we made a specific function for calculating the average number of ratings for each genre. This functions is only to be used for the iOS apps.

You may notice that, for apps that cost, the most downloaded are Games(6695), Business(4043), News(3872), Weather(3248), Music(2759). So it seems that people are willing to pay for either entertainment or information pertaining to everyday lives. Maybe this makes people feel safe that they are paying for a good relaible product, but it is not for sure with only this data.

In the free section, we notice that the top are navigation, reference, social networking/media, music, and weather. So the pattern here is a bit unclear but we see social media as on of the top ones. this makes sense because social media is becoming a main trend lately. However, the weird category is Navitgation as well as Reference.

In [171]:
for app in ios_free:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) # print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


Waze and google Maps heavily skew the naviagtion genre to the top of the list. This is because most people have GPS in their phones because it is such a useful tool that helps people get to different place. Getting to and from work at a faster time by avoiding traffic, cops, red lights, etc.



Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [172]:
for app in ios_free:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


However, there is an interesting pattern here many of the books in references are from a game called "Minecraft". this could potentially be good for a developer. Make a gaming app, but also provide books on the game. Some examples could be wikihows for the game, fictional versions of the game, tips and tricks for the game.

The idea is to make one hit game and all the other apps revolve around that one game.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

    Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

    Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink 
    app requires actual cooking and a delivery service, which is more than just a developer job.

    Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain 
    knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.




For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [173]:
display_table(android_free, 5) # the Installs columns
print("--------------------")
display_table(android_cost, 5) # the Installs columns

1,000,000+ : 15.75 %
100,000+ : 11.56 %
10,000,000+ : 10.5 %
10,000+ : 10.21 %
1,000+ : 8.39 %
100+ : 6.92 %
5,000,000+ : 6.83 %
500,000+ : 5.56 %
50,000+ : 4.77 %
5,000+ : 4.51 %
10+ : 3.54 %
500+ : 3.25 %
50,000,000+ : 2.3 %
100,000,000+ : 2.13 %
50+ : 1.92 %
5+ : 0.79 %
1+ : 0.51 %
500,000,000+ : 0.27 %
1,000,000,000+ : 0.23 %
0+ : 0.05 %
0 : 0.01 %
--------------------
1,000+ : 18.13 %
10,000+ : 15.6 %
100+ : 12.13 %
100,000+ : 10.93 %
10+ : 9.33 %
5,000+ : 8.67 %
50,000+ : 5.33 %
500+ : 5.33 %
50+ : 4.53 %
1+ : 2.8 %
1,000,000+ : 2.67 %
5+ : 1.6 %
500,000+ : 1.47 %
0+ : 1.2 %
10,000,000+ : 0.27 %


One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

To perform computations, however, we'll need to convert each install number to float. So we will need to replace the commas and pluses from the string in that specific entry. To do this, we will use the .replace() function.

In [174]:

def avg_installs(dataset):
    categories = freq_table(dataset, 1)
    downloads = {}
    for category in categories:
        total = 0
        len_category = 0
        for app in dataset:
            category_app = app[1]
            if category_app == category:            
                installs = app[5]
                installs = installs.replace(',', '')
                installs = installs.replace('+', '')
                total += float(installs)
                len_category += 1
        avg_installs = round(total / len_category,2)
        downloads[category] = avg_installs
        
    sorted_downloads = sort(downloads)
    for entry in sorted_downloads:
        print(entry[0],':',entry[1])

        
avg_installs(android_free)
print('-------------------------------------')
avg_installs(android_cost)

COMMUNICATION : 38326063.2
VIDEO_PLAYERS : 24790074.18
SOCIAL : 23253652.13
PHOTOGRAPHY : 17840110.4
PRODUCTIVITY : 16772838.59
TRAVEL_AND_LOCAL : 13984077.71
GAME : 12914435.88
TOOLS : 10801391.3
NEWS_AND_MAGAZINES : 9549178.47
ENTERTAINMENT : 9146923.08
BOOKS_AND_REFERENCE : 8767811.89
SHOPPING : 7036877.31
PERSONALIZATION : 5201482.61
FAMILY : 5180161.79
WEATHER : 5074486.2
SPORTS : 4274688.72
HEALTH_AND_FITNESS : 4167457.36
MAPS_AND_NAVIGATION : 4056941.77
ART_AND_DESIGN : 1986335.09
FOOD_AND_DRINK : 1924897.74
EDUCATION : 1768500.0
BUSINESS : 1704192.34
LIFESTYLE : 1437816.27
FINANCE : 1387692.48
HOUSE_AND_HOME : 1331540.56
DATING : 854028.83
COMICS : 817657.27
AUTO_AND_VEHICLES : 647317.82
LIBRARIES_AND_DEMO : 638503.73
PARENTING : 542603.62
BEAUTY : 513151.89
EVENTS : 253542.22
MEDICAL : 123064.79
-------------------------------------
GAME : 256097.13
FAMILY : 116201.73
WEATHER : 101500.0
ENTERTAINMENT : 100000.0
PHOTOGRAPHY : 98881.05
LIFESTYLE : 65506.11
SPORTS : 51825.62
PROD

On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:

In [175]:

for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
Gmail : 1,000,000,000+
imo beta free calls and text : 100,000,000+
imo free video calls and chat : 500,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
WhatsApp Messenger : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Hangouts : 1,000,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

We see the same pattern for the video players category(24,727,872). The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).

The books and reference(8,767,811) genre looks fairly popular as well. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

Let's take a look at some of the apps from this genre and their number of installs:

In [176]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
E

We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.

However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.

## Conclusions

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

There were multiple conclusions:

    1. Making a good communication app that can replace all the others will definetly become profitable over time. However, 
    this seems extremely unlikley unless the resources are there to make the app.
    2. Many peope are willing to pay for productive apps. If a good well rounded productive app can be made, with a 
    reasonable price that consumers would have to pay, the app would make good amounts of profit. 
    3. Making a game. This not only invloves making a game, but also revolving other apps around it to make the game more 
    popular. By doing this, one can increase profits as well as advertising. 
    4. Make a book app for best selling books. This app would need a little more than the average book in the libaries 
    though. Something like include daily quotes from the book, audio version of the book, quizzes on the book, a forum 
    where people can discuss the book, etc.
    
All in all, making any type of app requeires a good amount of resources into it. Expect to make profict after a long time because starting from scratch makes app making very difficult.