# User App Analysis

Our company only builds apps that are free to download and install, and our main source of revenue is in-app ads.  Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

Two existing data sets are available for Android and iOS apps.  
- [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately 10,000 Android apps from Google Play.  The data was collected in August 2018.
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately 7,000 iOS apps from the App Store.  The data was collected in July 2017.

Useful columns from App Store data:
- track_name
- price (only interested in free apps)
- prime_genre
- rating_count_tot (how many ratings)

Useful columns from Play Store data:
- App
- Category
- Installs
- Price (only the free apps)
- Genres

In [1]:
import pandas as pd
ios = pd.read_csv('AppleStore.csv')
android = pd.read_csv('googleplaystore.csv')

In [2]:
ios.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [3]:
ios.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 16 columns):
id                  7197 non-null int64
track_name          7197 non-null object
size_bytes          7197 non-null int64
currency            7197 non-null object
price               7197 non-null float64
rating_count_tot    7197 non-null int64
rating_count_ver    7197 non-null int64
user_rating         7197 non-null float64
user_rating_ver     7197 non-null float64
ver                 7197 non-null object
cont_rating         7197 non-null object
prime_genre         7197 non-null object
sup_devices.num     7197 non-null int64
ipadSc_urls.num     7197 non-null int64
lang.num            7197 non-null int64
vpp_lic             7197 non-null int64
dtypes: float64(3), int64(8), object(5)
memory usage: 899.7+ KB


In [4]:
android.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [5]:
android.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


# Data exploration and clean-up
From the chatter on Kaggle for the Android dataset, we know there is a row of data that has an error.  Considering that pandas read many of the columns as object when they should have been int or float (like `Reviews`), this row should be deleted.  The row appears to be around 10473.

In [6]:
android.iloc[10471:10475]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10471,Xposed Wi-Fi-Pwd,PERSONALIZATION,3.5,1042,404k,"100,000+",Free,0,Everyone,Personalization,"August 5, 2014",3.0.0,4.0.3 and up
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,
10473,osmino Wi-Fi: free WiFi,TOOLS,4.2,134203,4.1M,"10,000,000+",Free,0,Everyone,Tools,"August 7, 2018",6.06.14,4.4 and up
10474,Sat-Fi Voice,COMMUNICATION,3.4,37,14M,"1,000+",Free,0,Everyone,Communication,"November 21, 2014",2.2.1.5,2.2 and up


In [7]:
android.drop(10472, inplace=True)

Next we'll check to see if there are duplicate entries in the Android dataset and delete those rows.

In [8]:
duplicate_apps = android[android.duplicated(subset=['App'], keep=False)]
duplicate_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
23,Mcqueen Coloring pages,ART_AND_DESIGN,,61,7.0M,"100,000+",Free,0,Everyone,Art & Design;Action & Adventure,"March 7, 2018",1.0.0,4.1 and up
36,UNICORN - Color By Number & Pixel Art Coloring,ART_AND_DESIGN,4.7,8145,24M,"500,000+",Free,0,Everyone,Art & Design;Creativity,"August 2, 2018",1.0.9,4.4 and up
42,Textgram - write on photos,ART_AND_DESIGN,4.4,295221,Varies with device,"10,000,000+",Free,0,Everyone,Art & Design,"July 30, 2018",Varies with device,Varies with device
139,Wattpad 📖 Free Books,BOOKS_AND_REFERENCE,4.6,2914724,Varies with device,"100,000,000+",Free,0,Teen,Books & Reference,"August 1, 2018",Varies with device,Varies with device


In [9]:
duplicate_apps[duplicate_apps['App'] == 'Slack']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
240,Slack,BUSINESS,4.4,51507,Varies with device,"5,000,000+",Free,0,Everyone,Business,"August 2, 2018",Varies with device,Varies with device
269,Slack,BUSINESS,4.4,51507,Varies with device,"5,000,000+",Free,0,Everyone,Business,"August 2, 2018",Varies with device,Varies with device
294,Slack,BUSINESS,4.4,51510,Varies with device,"5,000,000+",Free,0,Everyone,Business,"August 2, 2018",Varies with device,Varies with device


In this example, the app "Slack" is on the list 3 times.  All the data is the same except for the number of reviews. A higher number of reviews would indicate that it is the most recent data on the list, and so we would prefer to keep this data.  We will go through all the duplicates and identify the one with the highest number of reviews to retain for our analysis.

In [10]:
duplicate_apps = duplicate_apps.sort_values(by=['App', 'Reviews']).copy()
duplicate_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1393,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8M,"500,000+",Free,0,Everyone 10+,Health & Fitness,"February 17, 2017",1.9,2.3.3 and up
1407,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8M,"500,000+",Free,0,Everyone 10+,Health & Fitness,"February 17, 2017",1.9,2.3.3 and up
2322,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26M,"1,000,000+",Free,0,Everyone,Medical,"July 27, 2018",7.4.1,5.0 and up
2543,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26M,"1,000,000+",Free,0,Everyone,Medical,"July 27, 2018",7.4.1,5.0 and up
2256,2017 EMRA Antibiotic Guide,MEDICAL,4.4,12,3.8M,"1,000+",Paid,$16.99,Everyone,Medical,"January 27, 2017",1.0.5,4.0.3 and up


In [11]:
duplicate_apps = duplicate_apps[duplicate_apps.duplicated(subset='App', keep='last')]
duplicate_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1393,10 Best Foods for You,HEALTH_AND_FITNESS,4.0,2490,3.8M,"500,000+",Free,0,Everyone 10+,Health & Fitness,"February 17, 2017",1.9,2.3.3 and up
2322,1800 Contacts - Lens Store,MEDICAL,4.7,23160,26M,"1,000,000+",Free,0,Everyone,Medical,"July 27, 2018",7.4.1,5.0 and up
2256,2017 EMRA Antibiotic Guide,MEDICAL,4.4,12,3.8M,"1,000+",Paid,$16.99,Everyone,Medical,"January 27, 2017",1.0.5,4.0.3 and up
1337,21-Day Meditation Experience,HEALTH_AND_FITNESS,4.4,11506,15M,"100,000+",Free,0,Everyone,Health & Fitness,"August 2, 2018",3.0.0,4.1 and up
5415,365Scores - Live Scores,SPORTS,4.6,666246,25M,"10,000,000+",Free,0,Everyone,Sports,"July 29, 2018",5.5.9,4.1 and up


In [12]:
android_clean = android.drop(duplicate_apps.index.values)
android_clean.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up


We created a dataframe with the duplicates used it to delete the extra rows from our Android dataset, now called `android_clean`.

# Checking for English-language apps

We are interested in creating English-language apps, so we would like to exclude app for other languages in our analysis.  Since there is no language column in either dataset, we will create a function to scan for non-English characters.  We'll consider an app to be a non-English app if there are more than 3 non-ASCII characters.

In [13]:
def is_english(string):
    notallowed = 0
    for character in string:
        if ord(character) > 127:
            notallowed += 1
            if notallowed > 3:
                return False
    return True

# testing function
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


The filter is not perfect, but it will allow us to screen out the obvious app names without sacrificing too much of the data.  We will use the `is_english()` function on both the Android and iOS datasets.

In [14]:
android_eng = android_clean[android_clean['App'].apply(is_english)]
android_eng.shape

(9614, 13)

In [15]:
ios_eng = ios[ios['track_name'].apply(is_english)]
ios_eng.shape

(6183, 16)

# Isolating the free apps

Since we only build apps that are free to install and use, relying on advertising revenue, we want to isolate the free apps currently available on each platform for our analysis to determine what type(s) of apps we should focus on developing.

In [16]:
android_free = android_eng[android_eng['Price'] == '0'].copy()
ios_free = ios_eng[ios_eng['price'] == 0].copy()
print('Number of free, English-only Android apps:', android_free.shape[0])
print('Number of free, English-only iOS apps:', ios_free.shape[0])

Number of free, English-only Android apps: 8862
Number of free, English-only iOS apps: 3222


# What types of apps should we focus on?

Our aim in our analysis is the determine what kinds of apps are likely to attract more users since our ad revenue is influenced by the number of people using our apps.

To minimize risks and overhead, the strategy for an app idea is as follows:

* Build a minimal Android app and add to Google Play
* If the app has a good response from users, we develop it further.
* If the app is profitable after 6 months, we build an iOS version of the app and add it to the App Store.

The ultimate goal is to add apps on _both_ Google Play and the App Store, so we need to find the types of apps that will be successful in both markets. To do this, we will create frequency tables for the `prime_genre` column in the App Store data and the `Genres` and `Category` columns of the Play Store data.

In [17]:
android_free['Genres'].value_counts(normalize=True)*100

Tools                                  8.440533
Entertainment                          6.070864
Education                              5.348680
Business                               4.592643
Productivity                           3.893026
Lifestyle                              3.893026
Finance                                3.701196
Medical                                3.520650
Sports                                 3.464229
Personalization                        3.317536
Communication                          3.238547
Action                                 3.103137
Health & Fitness                       3.080569
Photography                            2.945159
News & Magazines                       2.798465
Social                                 2.663056
Travel & Local                         2.324532
Shopping                               2.245543
Books & Reference                      2.143986
Simulation                             2.042428
Dating                                 1

In [18]:
android_free['Category'].value_counts(normalize=True)*100

FAMILY                 18.968630
GAME                    9.681787
TOOLS                   8.451817
BUSINESS                4.592643
LIFESTYLE               3.904311
PRODUCTIVITY            3.893026
FINANCE                 3.701196
MEDICAL                 3.520650
SPORTS                  3.396524
PERSONALIZATION         3.317536
COMMUNICATION           3.238547
HEALTH_AND_FITNESS      3.080569
PHOTOGRAPHY             2.945159
NEWS_AND_MAGAZINES      2.798465
SOCIAL                  2.663056
TRAVEL_AND_LOCAL        2.335816
SHOPPING                2.245543
BOOKS_AND_REFERENCE     2.143986
DATING                  1.861882
VIDEO_PLAYERS           1.794177
MAPS_AND_NAVIGATION     1.399233
FOOD_AND_DRINK          1.241255
EDUCATION               1.162266
ENTERTAINMENT           0.947867
LIBRARIES_AND_DEMO      0.936583
AUTO_AND_VEHICLES       0.925299
HOUSE_AND_HOME          0.823742
WEATHER                 0.801174
EVENTS                  0.710900
PARENTING               0.654480
ART_AND_DE

In [19]:
ios_free['prime_genre'].value_counts(normalize=True)*100

Games                58.162632
Entertainment         7.883302
Photo & Video         4.965860
Education             3.662322
Social Networking     3.289882
Shopping              2.607076
Utilities             2.513966
Sports                2.141527
Music                 2.048417
Health & Fitness      2.017381
Productivity          1.738051
Lifestyle             1.582868
News                  1.334575
Travel                1.241465
Finance               1.117318
Weather               0.869025
Food & Drink          0.806952
Reference             0.558659
Business              0.527623
Book                  0.434513
Navigation            0.186220
Medical               0.186220
Catalogs              0.124146
Name: prime_genre, dtype: float64

Among the iOS apps, Games and Entertainment make up two-thirds of the apps.  There are significantly more apps geared toward fun than productivity.

Looking at the Android apps, there is a lot more granularity in the `Genres` column than the `Category` column, as well as some overlap as some apps span across more than one category.  The top apps are in Tools and Entertainment genres (8% and 6%, respectively).  The top categories are Family (19%) and Games (10%).  Since the `Category` column is more general, we'll focus on that column from now on for the Android dataset.

## Looking at number of users per category

However, despite knowing the genres and categories that contain the most number of apps, it doesn't inform us of which categories have the largest number of users.  There is a possiblity that there are a larger number of users using a small number of apps in other categories.  It would be helpful to examine the number of users or installs.

For the Android dataset, we can use the `Installs` column to get a general number for each category.  The iOS data does not have information on the number of installs, but we can use the `rating_count_tot` that contains the number of user reviews for the apps to get a rough approximation for the average number of users.

In [20]:
import numpy as np
ios_free.groupby('prime_genre')['rating_count_tot'].agg(np.mean).sort_values(ascending=False)

prime_genre
Navigation           86090.333333
Reference            74942.111111
Social Networking    71548.349057
Music                57326.530303
Weather              52279.892857
Book                 39758.500000
Food & Drink         33333.923077
Finance              31467.944444
Photo & Video        28441.543750
Travel               28243.800000
Shopping             26919.690476
Health & Fitness     23298.015385
Sports               23008.898551
Games                22788.669691
News                 21248.023256
Productivity         21028.410714
Utilities            18684.456790
Lifestyle            16485.764706
Entertainment        14029.830709
Business              7491.117647
Education             7003.983051
Catalogs              4004.000000
Medical                612.000000
Name: rating_count_tot, dtype: float64

The top 5 categories with the highest average number of users are Navigation, Reference, Social Networking, Music, and Weather.  Based on these results, it seems that an app in one of these practical categories, rather than Games, would best fit our revenue model.

The `Installs` column in the Android dataset contains open-ended values (100+, 500+, etc) instead of exact numbers.  We don't need precise data to determine the categories with the highest average users, so we'll leave the numbers as is and remove the non-numerical characters for calculations.

In [21]:
android_free['Installs'] = android_free['Installs'].str.replace('+', '').str.replace(',', '').astype('int')

In [22]:
android_free.groupby('Category')['Installs'].agg(np.mean).sort_values(ascending=False).map('{:,.0f}'.format)

Category
COMMUNICATION          38,456,119
VIDEO_PLAYERS          24,727,872
SOCIAL                 23,253,652
PHOTOGRAPHY            17,805,628
PRODUCTIVITY           16,787,331
GAME                   15,567,447
TRAVEL_AND_LOCAL       13,984,078
ENTERTAINMENT          11,719,762
TOOLS                  10,682,301
NEWS_AND_MAGAZINES      9,549,178
BOOKS_AND_REFERENCE     8,767,812
SHOPPING                7,036,877
PERSONALIZATION         5,201,483
WEATHER                 5,074,486
HEALTH_AND_FITNESS      4,188,822
MAPS_AND_NAVIGATION     4,056,942
FAMILY                  3,697,201
SPORTS                  3,638,640
ART_AND_DESIGN          1,986,335
FOOD_AND_DRINK          1,924,898
EDUCATION               1,828,641
BUSINESS                1,712,290
LIFESTYLE               1,437,816
FINANCE                 1,387,692
HOUSE_AND_HOME          1,331,541
DATING                    854,029
COMICS                    817,657
AUTO_AND_VEHICLES         647,318
LIBRARIES_AND_DEMO        638,504
PAREN

For the Google Play dataset, the top 5 categories with the highest average number of installs are Communication, Video Players, Social, Photography, and Productivity.

# Recommendations

Based on this information, along with that from the App Store, our recommendation is to develop an app in the social media category.  This category was in the top 5 in both platforms, giving us the broadest reach for our own app.