# Profitable Apps Dataset
Data analyst project for a company which creates Android and iOS mobile apps, available on the Google Play and App Store. The company only builds apps that are free to download and install, and their main source of revenue consists of in-app ads. 

This means that the number of users of apps determines the revenue for any given app — the more users who see and engage with the ads, the better. The goal for this project is to analyze data to help the developers understand what type of apps are likely to attract more users.

This project includes several parts:
1. Opening & Exploring the Dataset
2. Cleaning the Dataset
 - Deleting Incorrect Data
 - Checking for Null Values
 - Deleting Duplicate Values
 - Removing Non-English Apps
 - Removing Non-Free Apps
3. Finding a Profitable App Profile
 - Most Common Apps by Genre
 - Most Popular Apps by Genre

In [1]:
import numpy as np
import pandas as pd

## Opening & Exploring the Dataset

Our dataset contains a sample of app data. **apple_data** contains data on 7200 apps, **google_data** contains data on 9660 apps.

Apple Store data available from: https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps

Google Play Store data available from: https://www.kaggle.com/datasets/lava18/google-play-store-apps

In [2]:
apple_data = pd.read_csv('AppleStore.csv')
google_data = pd.read_csv('googleplaystore.csv')

In [3]:
apple_data

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7192,1170406182,Shark Boom - Challenge Friends with your Pet,245415936,USD,0.0,0,0,0.0,0.0,1.0.9,4+,Games,38,5,1,1
7193,1069830936,【謎解き】ヤミすぎ彼女からのメッセージ,16808960,USD,0.0,0,0,0.0,0.0,1.2,9+,Book,38,0,1,1
7194,1070052833,Go!Go!Cat!,91468800,USD,0.0,0,0,0.0,0.0,1.1.2,12+,Games,37,2,2,1
7195,1081295232,Suppin Detective: Expose their true visage!,83026944,USD,0.0,0,0,0.0,0.0,1.0.3,12+,Entertainment,40,0,1,1


In [4]:
google_data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


Note that the Apple dataset has **16** columns and the Google dataset has **13** columns.

In [5]:
# Apple column values:
for col in apple_data.columns:
    print(col)

id
track_name
size_bytes
currency
price
rating_count_tot
rating_count_ver
user_rating
user_rating_ver
ver
cont_rating
prime_genre
sup_devices.num
ipadSc_urls.num
lang.num
vpp_lic


In [6]:
# Google column values:
for col in google_data.columns:
    print(col)

App
Category
Rating
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current Ver
Android Ver


## Cleaning the Dataset
Before we begin analzying the data, we will want to ensure that the values are correct, and that there are no missing or null values

### Deleting Incorrect Data
The Google Play Store dataset has a discussion section, which outlines an error in the dataset in row 10472. 

In [7]:
# Correct row:
print(google_data.loc[1])
print()
# Row with error:
print(google_data.loc[10472])

App                     Coloring book moana
Category                     ART_AND_DESIGN
Rating                                  3.9
Reviews                                 967
Size                                    14M
Installs                           500,000+
Type                                   Free
Price                                     0
Content Rating                     Everyone
Genres            Art & Design;Pretend Play
Last Updated               January 15, 2018
Current Ver                           2.0.0
Android Ver                    4.0.3 and up
Name: 1, dtype: object

App               Life Made WI-Fi Touchscreen Photo Frame
Category                                              1.9
Rating                                                 19
Reviews                                              3.0M
Size                                               1,000+
Installs                                             Free
Type                                                    

The rating value of 19 is off (ratings are between 1-5), the Content Rating is missing, and the Price value is incorrect.

We can delete this row and check for other null/missing values in the dataset.

In [8]:
print(len(google_data))
google_data.drop(10472, inplace=True)
print(len(google_data))

10841
10840


### Checking for Null Values
Let's check the total number of null (empty) values for each dataset:

In [9]:
apple_data.isnull().sum()

id                  0
track_name          0
size_bytes          0
currency            0
price               0
rating_count_tot    0
rating_count_ver    0
user_rating         0
user_rating_ver     0
ver                 0
cont_rating         0
prime_genre         0
sup_devices.num     0
ipadSc_urls.num     0
lang.num            0
vpp_lic             0
dtype: int64

No null values in app_data, checking google_data:

In [10]:
google_data.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               0
Last Updated         0
Current Ver          8
Android Ver          2
dtype: int64

1,474 apps have no value for the Rating Category, 1 has no value in Type, 8 have no value for Current Ver and 2 have no value for Android Ver.

Which rows should be removed? We will leave the apps with null values in columns we aren't concerned with (Current ver & Android ver).

##### Null Values in Type

In [11]:
google_data['Type'].value_counts(dropna=False)

Free    10039
Paid      800
NaN         1
Name: Type, dtype: int64

We are only interested in Free Apps, so an app with no value in the Type column is bad for our dataset. Let's look into the app with the null value:

In [12]:
google_data[google_data['Type'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


The app has a NaN value under Type but a price of 0 -- instead of deleting it we can update it to 'Free':

In [13]:
google_data.loc[9148, 'Type'] = 'Free'

In [14]:
google_data['Type'].value_counts(dropna=False)

Free    10040
Paid      800
Name: Type, dtype: int64

##### Null values in Rating

In [15]:
google_data.Rating.unique()

array([4.1, 3.9, 4.7, 4.5, 4.3, 4.4, 3.8, 4.2, 4.6, 3.2, 4. , nan, 4.8,
       4.9, 3.6, 3.7, 3.3, 3.4, 3.5, 3.1, 5. , 2.6, 3. , 1.9, 2.5, 2.8,
       2.7, 1. , 2.9, 2.3, 2.2, 1.7, 2. , 1.8, 2.4, 1.6, 2.1, 1.4, 1.5,
       1.2])

We might not need rating data for our purposes, but we don't want to leave it empty. Instead, we can change the null values to 'Unrated':

In [16]:
google_data['Rating'].fillna("Unrated", inplace=True)

In [17]:
google_data.Rating.unique()

array([4.1, 3.9, 4.7, 4.5, 4.3, 4.4, 3.8, 4.2, 4.6, 3.2, 4.0, 'Unrated',
       4.8, 4.9, 3.6, 3.7, 3.3, 3.4, 3.5, 3.1, 5.0, 2.6, 3.0, 1.9, 2.5,
       2.8, 2.7, 1.0, 2.9, 2.3, 2.2, 1.7, 2.0, 1.8, 2.4, 1.6, 2.1, 1.4,
       1.5, 1.2], dtype=object)

### Removing duplicate Values
We want to ensure that the data in our dataset is not only correct and non-empty, but also that there are no duplicate entries.

Let's look at the title of apps to see if an app is listed multiple times:

In [18]:
unique_apple_apps = apple_data['track_name'].nunique()
unique_google_apps = google_data['App'].nunique()

print("Unique Apple apps:", unique_apple_apps)
print("Total Apple apps:", len(apple_data))
print("Duplicate Apple apps:", len(apple_data) - unique_apple_apps)
print()
print("Unique Google apps:", unique_google_apps)
print("Total Google apps:", len(google_data))
print("Duplicate Google apps:", len(google_data) - unique_google_apps)

Unique Apple apps: 7195
Total Apple apps: 7197
Duplicate Apple apps: 2

Unique Google apps: 9659
Total Google apps: 10840
Duplicate Google apps: 1181


There are 2 duplicate iOS apps and 1181 duplicate Android apps. Let's start with the apple_data set:

In [19]:
apple_data.track_name.value_counts()[:2]

Mannequin Challenge    2
VR Roller Coaster      2
Name: track_name, dtype: int64

We can see from the above value counts of the track_name column that 'Mannequin Challenge' and 'VR Roller Coaster' are listed twice. Let's look at each one:

In [20]:
apple_data.loc[apple_data['track_name'] == 'Mannequin Challenge']

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
2948,1173990889,Mannequin Challenge,109705216,USD,0.0,668,87,3.0,3.0,1.4,9+,Games,37,4,1,1
4463,1178454060,Mannequin Challenge,59572224,USD,0.0,105,58,4.0,4.5,1.0.1,4+,Games,38,5,1,1


In [21]:
apple_data.loc[apple_data['track_name'] == 'VR Roller Coaster']

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
4442,952877179,VR Roller Coaster,169523200,USD,0.0,107,102,3.5,3.5,2.0.0,4+,Games,37,5,1,1
4831,1089824278,VR Roller Coaster,240964608,USD,0.0,67,44,3.5,4.0,0.81,4+,Games,38,0,1,1


Both apps differ from their duplicate in bytesize, ratings, and in the case of Mannequin Challenge, cont_rating, making them appear to be different apps with the same name. We can leave the apps in the dataset.

Now let's check the Google dataset:

In [22]:
google_data.App.value_counts()[:20]

ROBLOX                                                9
CBS Sports App - Scores, News, Stats & Watch Live     8
ESPN                                                  7
Duolingo: Learn Languages Free                        7
8 Ball Pool                                           7
Candy Crush Saga                                      7
Sniper 3D Gun Shooter: Free Shooting Games - FPS      6
Temple Run 2                                          6
Nick                                                  6
Subway Surfers                                        6
Bowmasters                                            6
Helix Jump                                            6
slither.io                                            6
Bleacher Report: sports news, scores, & highlights    6
Bubble Shooter                                        6
Zombie Catchers                                       6
theScore: Live Sports Scores, News, Stats & Videos    5
TripAdvisor Hotels Flights Restaurants Attractio

There are over 20 duplicates, many of which have 6 or more rows with the same app name. Let's look at the most duplicated app, ROBLOX:

In [23]:
google_data.loc[google_data['App'] == 'ROBLOX']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1653,ROBLOX,GAME,4.5,4447388,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1701,ROBLOX,GAME,4.5,4447346,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1748,ROBLOX,GAME,4.5,4448791,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1841,ROBLOX,GAME,4.5,4449882,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
1870,ROBLOX,GAME,4.5,4449910,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2016,ROBLOX,FAMILY,4.5,4449910,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2088,ROBLOX,FAMILY,4.5,4450855,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
2206,ROBLOX,FAMILY,4.5,4450890,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up
4527,ROBLOX,FAMILY,4.5,4443407,67M,"100,000,000+",Free,0,Everyone 10+,Adventure;Action & Adventure,"July 31, 2018",2.347.225742,4.1 and up


The values in each column look fairly similar, meaning they are likely duplicates. Let's check if the other apps are also duplicates:

In [24]:
google_data.loc[google_data['App'] == 'CBS Sports App - Scores, News, Stats & Watch Live']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2976,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91031,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up
3007,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91031,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up
3015,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91031,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up
3020,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91031,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up
3056,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91033,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up
3064,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91033,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up
3090,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91033,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up
9594,"CBS Sports App - Scores, News, Stats & Watch Live",SPORTS,4.3,91035,Varies with device,"5,000,000+",Free,0,Everyone,Sports,"August 4, 2018",Varies with device,5.0 and up


In [25]:
google_data.loc[google_data['App'] == '8 Ball Pool']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1675,8 Ball Pool,GAME,4.5,14198297,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1703,8 Ball Pool,GAME,4.5,14198602,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1755,8 Ball Pool,GAME,4.5,14200344,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1844,8 Ball Pool,GAME,4.5,14200550,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1871,8 Ball Pool,GAME,4.5,14201891,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
1970,8 Ball Pool,GAME,4.5,14201604,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up
3953,8 Ball Pool,SPORTS,4.5,14184910,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up


They all look similar enough to classify as duplicates. We can write a function to iterate through the dataset and remove all these duplicate apps.

#### Deciding which app to keep
We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed above, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

In the above example, we want the 8 Ball Pool game with the following number of reviews:

In [26]:
google_data.loc[google_data['App'] == '8 Ball Pool'].Reviews.max()

'14201891'

First, we'll set the Reviews column as an integer to sort the column:

In [27]:
google_data['Reviews'] = google_data['Reviews'].astype(int)

Now we can use the drop_duplicates function on the column App on the sorted dataframe. This will drop all duplicates values in App name while keeping the last value, which has the highest number of reviews:

In [28]:
google_data = google_data.sort_values('Reviews').drop_duplicates(subset=['App'], keep='last')

In [29]:
google_data

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
6665,BAR-B-Q Recipes,FOOD_AND_DRINK,Unrated,0,3.6M,100+,Free,0,Everyone,Food & Drink,"January 30, 2018",1.0,4.0 and up
7732,SHUTTLLS CQ - Connect Ride Go,TRAVEL_AND_LOCAL,Unrated,0,18M,5+,Free,0,Everyone,Travel & Local,"April 26, 2018",4.6.2100,4.3 and up
7735,CQ Ukraine,PRODUCTIVITY,Unrated,0,9.1M,10+,Free,0,Everyone,Productivity,"June 25, 2018",1.17.0,4.1 and up
9337,EG | Explore Folegandros,TRAVEL_AND_LOCAL,Unrated,0,56M,0+,Paid,$3.99,Everyone,Travel & Local,"January 22, 2017",1.1.1,4.1 and up
7737,CQ Electrical Group,PRODUCTIVITY,Unrated,0,14M,1+,Free,0,Everyone,Productivity,"July 19, 2018",1.0.0,4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1879,Clash of Clans,GAME,4.6,44893888,98M,"100,000,000+",Free,0,Everyone 10+,Strategy,"July 15, 2018",10.322.16,4.1 and up
382,Messenger – Text and Video Chat for Free,COMMUNICATION,4,56646578,Varies with device,"1,000,000,000+",Free,0,Everyone,Communication,"August 1, 2018",Varies with device,Varies with device
2604,Instagram,SOCIAL,4.5,66577446,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"July 31, 2018",Varies with device,Varies with device
336,WhatsApp Messenger,COMMUNICATION,4.4,69119316,Varies with device,"1,000,000,000+",Free,0,Everyone,Communication,"August 3, 2018",Varies with device,Varies with device


Our google_data dataframe now has the expected 9659 rows with no duplicate values. Let's check the Review count for the duplicate app we listed above:

In [30]:
google_data.loc[google_data['App'] == '8 Ball Pool']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1871,8 Ball Pool,GAME,4.5,14201891,52M,"100,000,000+",Free,0,Everyone,Sports,"July 31, 2018",4.0.0,4.0.3 and up


There is only one app under that name, and with the highest number of reviews. We can revert the dataframe to be sorted by index: 

In [31]:
google_data.sort_index(inplace=True)

###  Removing Non-English Apps

If you explore the data sets enough, you'll notice the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [32]:
print(apple_data.loc[813].track_name)
print(apple_data.loc[6731].track_name)
print(google_data.loc[5513].App)
print(google_data.loc[9117].App)

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


For our data purposes, we're only going to look at Apps in English. So we can create a function to remove all non-English apps from our dataset.

In [33]:
def is_english(s):
    return s.isascii()

In [34]:
print(is_english(apple_data.loc[813].track_name))
print(is_english(apple_data.loc[6731].track_name))
print(is_english(google_data.loc[5513].App))
print(is_english(google_data.loc[9117].App))

False
False
False
False


The `.isascii` functions checks if the characters in a string are ascii. This works well enough to remove apps outside of the alphabet range, but it also removes English apps with non-ascii characters:

In [35]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

False
False


Instead we can create a new function that will check app names for non-ascii characters and, if there are more than 3, will mark the app as non-English.

In [36]:
def is_english(s):
    non_ascii = 0
    for char in s:
        if ord(char) > 127:
            non_ascii += 1
        if non_ascii > 3:
            return False
    return True

In [37]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english('中国語 AQリスニング'))

True
True
False


This function works well enough for our purposes, although it's possible a few non-English apps might make it past our filter, and some English apps will be removed. But we can work on optimizing this later.

Below, we use the `is_english()` function to filter out the non-English apps for both data sets using the `map` function:

In [38]:
google_data = google_data[google_data['App'].map(is_english)]

In [39]:
apple_data = apple_data[apple_data['track_name'].map(is_english)]

In [40]:
print("Google data rows: ", google_data.shape[0])
print("Apple data rows: ", apple_data.shape[0])

Google data rows:  9614
Apple data rows:  6183


We're now left with 9,614 Google apps and 6,183 Apple apps

###  Removing Non-Free Apps

Because the company only produces free-to-use apps, we are going to isolate our dataset to only include apps which are free to download.

We can do this by preserving all rows in which the 'Type' column in equal to 'Free' (for google_data) and 'price' is equal to 0.0 (for apple_data)

In [41]:
google_data = google_data[google_data['Type'] == 'Free']
apple_data = apple_data[apple_data['price'] == 0.0]

In [42]:
print("Google data rows: ", google_data.shape[0])
print("Apple data rows: ", apple_data.shape[0])

Google data rows:  8864
Apple data rows:  3222


## Finding a Profitable App Profile

### Most Common Apps By Genre

As mentioned in the introduction, the aim is to determine the kinds of apps that are likely to attract more users. Because the end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, let's create a frequency table for each value in the 'prime_genre' column (for apple_data), and the 'Genres' and 'Category' columns (for google_data).

In [43]:
apple_data['prime_genre'].value_counts() / apple_data.shape[0] * 100

Games                58.162632
Entertainment         7.883302
Photo & Video         4.965860
Education             3.662322
Social Networking     3.289882
Shopping              2.607076
Utilities             2.513966
Sports                2.141527
Music                 2.048417
Health & Fitness      2.017381
Productivity          1.738051
Lifestyle             1.582868
News                  1.334575
Travel                1.241465
Finance               1.117318
Weather               0.869025
Food & Drink          0.806952
Reference             0.558659
Business              0.527623
Book                  0.434513
Medical               0.186220
Navigation            0.186220
Catalogs              0.124146
Name: prime_genre, dtype: float64

#### Insights in apple_data

We can see from the frequency table that more than half of all English free apps are for Games, at 58%. Dominating the top 3 most common apps in the App Store are apps designed for fun, such as Games, Entertainment and Photo & Video, which compromise roughly 71% of all apps. Practical apps such as those in the category of Education, Utilities, Productivity, Lifestyle, etc are more rare.

However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

Let's look into google_data next:

In [44]:
google_data['Category'].value_counts() / google_data.shape[0] * 100

FAMILY                 18.930505
GAME                    9.713448
TOOLS                   8.461191
BUSINESS                4.591606
LIFESTYLE               3.903430
PRODUCTIVITY            3.892148
FINANCE                 3.700361
MEDICAL                 3.531137
SPORTS                  3.395758
PERSONALIZATION         3.316787
COMMUNICATION           3.237816
HEALTH_AND_FITNESS      3.079874
PHOTOGRAPHY             2.944495
NEWS_AND_MAGAZINES      2.797834
SOCIAL                  2.662455
TRAVEL_AND_LOCAL        2.335289
SHOPPING                2.245036
BOOKS_AND_REFERENCE     2.143502
DATING                  1.861462
VIDEO_PLAYERS           1.793773
MAPS_AND_NAVIGATION     1.398917
FOOD_AND_DRINK          1.240975
EDUCATION               1.162004
ENTERTAINMENT           0.947653
LIBRARIES_AND_DEMO      0.936372
AUTO_AND_VEHICLES       0.925090
HOUSE_AND_HOME          0.823556
WEATHER                 0.800993
EVENTS                  0.710740
PARENTING               0.654332
ART_AND_DE

In [45]:
google_data['Genres'].value_counts() / google_data.shape[0] * 100

Tools                          8.449910
Entertainment                  6.069495
Education                      5.347473
Business                       4.591606
Productivity                   3.892148
                                 ...   
Strategy;Action & Adventure    0.011282
Racing;Pretend Play            0.011282
Books & Reference;Education    0.011282
Art & Design;Pretend Play      0.011282
Puzzle;Education               0.011282
Name: Genres, Length: 114, dtype: float64

#### Insights into google_data

We can see from the results that the most common Categories are Family, Game, Tools and Business, whereas the most common Genres are Tools, Entertainment, Education, Business and Productivity. Let's combine both columns to find the most common apps by both Category and Genre:

In [46]:
google_data.groupby(['Category', 'Genres']).size().sort_values(ascending=False)

Category       Genres                               
TOOLS          Tools                                    749
FAMILY         Entertainment                            459
BUSINESS       Business                                 407
FAMILY         Education                                382
LIFESTYLE      Lifestyle                                345
                                                       ... 
FAMILY         Adventure;Education                        1
               Music & Audio;Music & Video                1
               Education;Brain Games                      1
VIDEO_PLAYERS  Video Players & Editors;Music & Video      1
FAMILY         Arcade;Pretend Play                        1
Length: 132, dtype: int64

The most common apps are in Tools & Tools, followed by Family & Entertainment, then Business & Business, Famly & Education, and Lifestyle & Lifestyle

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

### Most Popular Apps by Genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the 'rating_count_tot' column.

### App Store Analysis

Below, we calculate the average number of user ratings per app genre on the App Store:

In [47]:
apple_data.groupby(['prime_genre']).mean().sort_values(['rating_count_tot'], ascending=False)

Unnamed: 0_level_0,id,size_bytes,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
prime_genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Navigation,425013800.0,96481790.0,0.0,86090.333333,749.5,3.833333,2.25,37.166667,3.333333,20.166667,1.0
Reference,771769700.0,120077200.0,0.0,74942.111111,2746.555556,3.666667,3.861111,36.555556,3.5,10.277778,1.0
Social Networking,733306200.0,93665390.0,0.0,71548.349057,867.858491,3.59434,2.985849,36.216981,1.95283,12.566038,0.990566
Music,726257400.0,86460170.0,0.0,57326.530303,628.075758,3.94697,3.931818,36.5,3.742424,8.545455,1.0
Weather,632221600.0,89602890.0,0.0,52279.892857,1798.5,3.482143,3.017857,36.821429,3.678571,11.321429,1.0
Book,824671200.0,118487000.0,0.0,39758.5,485.428571,3.071429,3.142857,37.285714,3.071429,4.642857,1.0
Food & Drink,704776700.0,73952260.0,0.0,33333.923077,755.807692,3.634615,3.25,36.615385,1.230769,4.346154,1.0
Finance,584580700.0,94659440.0,0.0,31467.944444,697.222222,3.375,2.847222,35.472222,2.472222,1.805556,1.0
Photo & Video,820967000.0,84955280.0,0.0,28441.54375,435.7375,3.903125,3.384375,36.475,2.61875,11.5875,1.0
Travel,592205000.0,97872030.0,0.0,28243.8,244.075,3.4875,2.7375,37.325,2.375,10.2,0.975


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [48]:
apple_data.loc[apple_data['prime_genre'] == 'Navigation']

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
49,323229106,"Waze - GPS Navigation, Maps & Real-time Traffic",94139392,USD,0.0,345046,3040,4.5,4.5,4.24,4+,Navigation,37,5,36,1
130,585027354,Google Maps - Navigation & Transit,120232960,USD,0.0,154911,1253,4.5,4.0,4.31.1,12+,Navigation,37,5,34,1
881,329541503,Geocaching®,108166144,USD,0.0,12811,134,3.5,1.5,5.3,4+,Navigation,37,0,22,1
1633,504677517,CoPilot GPS – Car Navigation & Offline Maps,82534400,USD,0.0,3582,70,4.0,3.5,10.0.0.984,4+,Navigation,38,5,25,1
3987,344176018,ImmobilienScout24: Real Estate Search in Germany,126867456,USD,0.0,187,0,3.5,0.0,9.5,4+,Navigation,37,5,3,1
6033,463431091,Railway Route Search,46950400,USD,0.0,5,0,3.0,0.0,3.17.1,4+,Navigation,37,0,1,1


The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same applies to music apps, where a few big players like Pandora, Spotify, and Shazam heavily influence the average number.

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, which we'll get to later.

Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [49]:
apple_data.loc[apple_data['prime_genre'] == 'Reference'][:5]

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
6,282935706,Bible,92774400,USD,0.0,985920,5320,4.5,5.0,7.5.1,4+,Reference,37,5,45,1
90,308750436,Dictionary.com Dictionary & Thesaurus,111275008,USD,0.0,200047,177,4.0,4.0,7.1.3,4+,Reference,37,0,1,1
335,364740856,Dictionary.com Dictionary & Thesaurus for iPad,165748736,USD,0.0,54175,10176,4.5,4.5,4.0,4+,Reference,24,5,9,1
551,414706506,Google Translate,65281024,USD,0.0,26786,27,3.5,4.5,5.10.0,4+,Reference,37,5,59,1
715,388389451,"Muslim Pro: Ramadan 2017 Prayer Times, Azan, Q...",100551680,USD,0.0,18418,706,4.5,5.0,9.2.1,4+,Reference,37,5,16,1


#### Insights
However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.

This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.

Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:

  - Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.

  - Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.

  - Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.

Now let's analyze the Google Play market a bit.

### Google Play Analysis

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [50]:
google_data['Installs'].value_counts()

1,000,000+        1394
100,000+          1024
10,000,000+        935
10,000+            904
1,000+             744
100+               613
5,000,000+         605
500,000+           493
50,000+            423
5,000+             400
10+                314
500+               288
50,000,000+        204
100,000,000+       189
50+                170
5+                  70
1+                  45
500,000,000+        24
1,000,000,000+      20
0+                   4
0                    1
Name: Installs, dtype: int64

One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).