# Analyzing Mobile App Data

#### This guided project's objective is to demonstrate competence in data analysis/science tasks utilizing python programming. Below are the libraries that we utilized within this project:

In [10]:
import pandas as pd
import numpy as np

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, there are two datasets that seem suitable for our goals from the google app store and apple app store. We will utilize the databases to determine the most popular genre of apps, and develop an app in a genre popular across both the google play store, and the apple app store to maximize our chances of developing a popular app.

In [11]:
google_apps = pd.read_csv('https://dq-content.s3.amazonaws.com/350/googleplaystore.csv')
google_apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [12]:
apple_apps = pd.read_csv('https://dq-content.s3.amazonaws.com/350/AppleStore.csv')
apple_apps.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [13]:
google_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [14]:
apple_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7197 entries, 0 to 7196
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                7197 non-null   int64  
 1   track_name        7197 non-null   object 
 2   size_bytes        7197 non-null   int64  
 3   currency          7197 non-null   object 
 4   price             7197 non-null   float64
 5   rating_count_tot  7197 non-null   int64  
 6   rating_count_ver  7197 non-null   int64  
 7   user_rating       7197 non-null   float64
 8   user_rating_ver   7197 non-null   float64
 9   ver               7197 non-null   object 
 10  cont_rating       7197 non-null   object 
 11  prime_genre       7197 non-null   object 
 12  sup_devices.num   7197 non-null   int64  
 13  ipadSc_urls.num   7197 non-null   int64  
 14  lang.num          7197 non-null   int64  
 15  vpp_lic           7197 non-null   int64  
dtypes: float64(3), int64(8), object(5)
memory 

Now that we have imported the datasets and assigned index columns that can be utilized to uniquely identify the rows in the data frames, we can begin to clean the data frame. A comment on a discussion post identified there is a wrong rating entry for row 10472 in the Google Play dataset. We will remove the row.

In [96]:
google_apps[google_apps['Content Rating'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,English


Now that we have specified ttht the row described in the error meets the description of the user, we will remove the row from the dataframe.

In [16]:
google_apps.dropna(subset=['Content Rating'], inplace=True)

Let's inspect the dataframes to ensure we do not have any more null values that may prove problematic when we begin analysis.

In [17]:
print(google_apps.isnull().sum())

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               0
Last Updated         0
Current Ver          8
Android Ver          2
dtype: int64


In [18]:
print(apple_apps.isnull().sum())

id                  0
track_name          0
size_bytes          0
currency            0
price               0
rating_count_tot    0
rating_count_ver    0
user_rating         0
user_rating_ver     0
ver                 0
cont_rating         0
prime_genre         0
sup_devices.num     0
ipadSc_urls.num     0
lang.num            0
vpp_lic             0
dtype: int64


Upon inspection, there are no more rows in the dataframe that have a null value for content ratings. It seems as if there are a significant amount of ratings in the google_apps dataframe that are null, one app that's type is null, and a handful of apps whose current or Android version is null. Null values in rating may affect our analysis, and a null value for type isn't ideal, but the version of the app is irrelevant to our analysis. Let's remove the null values in Rating, then inspect the app with a null value in "Type".

In [19]:
google_apps.dropna(subset = ['Rating'], inplace = True)

In [20]:
null_rows_type = google_apps[google_apps['Type'].isnull()]
null_rows_type

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


There is a significant amount of data missing from the ratings row, and the missing values may interfere with our analysis. We will drop the null values in the "Type" column as well.

In [21]:
google_apps.dropna(subset = ['Type'], inplace = True)

In [22]:
google_apps.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       4
Android Ver       2
dtype: int64

Upon inspection ,the google_apps dataframe is much cleaner than when we started after removing null values.

Another crucial aspect of data cleaning is ensuring there is no duplicate data. Since we have columns where duplicates are okay, we need to ensure we specify that we are only dropping duplicate names.

In [23]:
google_apps.drop_duplicates(subset = ['App'], inplace = True)

In [24]:
apple_apps.drop_duplicates(subset = ['id'], inplace = True)

Our company is based in the United States and focused on analyzing the performance of apps made for english speakers. Ensure that only apps with characters commonly found within the English lexicon are present in the names of the applications in this table. First I will create a function that labels data within a string input based on the output of the "ord" function. The ord function takes an integer as an input and the output is a character. The characters most commonly found in the English lexicon are from 0 to 127. I will add special characters to the range that are not included in the original 0-127.

In [25]:
def english_detector(string):
    english_range = list(range(128))+[174]+[8482]+[8211]+[8212]+[128214]
    for character in string:
        if ord(character) not in english_range:
            return 'Not English'
    return 'English'


Ensure the english detector function works.

In [26]:
english_detector('Is this english?')

'English'

In [27]:
english_detector('漫咖')

'Not English'

In [28]:
english_detector('São')

'Not English'

Apply the english detector function to the ased on the "App" column of the dataframe,

In [29]:
google_apps['English'] = google_apps['App'].apply(english_detector)  

Inspect the 'Not English' entries in the column to ensure that the results algin with our intentions of identifying and categorizing non-english applications.

In [30]:
google_apps[google_apps['English']=='Not English']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,English
89,Zona Azul Digital Fácil SP CET - OFFICIAL São ...,AUTO_AND_VEHICLES,4.6,7880,Varies with device,"100,000+",Free,0,Everyone,Auto & Vehicles,"May 10, 2018",4.6.5,Varies with device,Not English
300,Röhrich Werner Soundboard,COMICS,4.7,2249,32M,"500,000+",Free,0,Everyone,Comics,"November 16, 2017",1.08,4.0.3 and up,Not English
309,Truyện Vui Tý Quậy,COMICS,4.5,144,4.7M,"10,000+",Free,0,Everyone,Comics,"July 19, 2018",3.0,4.0.3 and up,Not English
310,Comic Es - Shojo manga / love comics free of c...,COMICS,3.9,2181,10M,"100,000+",Free,0,Teen,Comics,"March 5, 2018",1.2.12,4.0.3 and up,Not English
313,"漫咖 Comics - Manga,Novel and Stories",COMICS,4.1,12088,21M,"1,000,000+",Free,0,Mature 17+,Comics,"July 6, 2018",2.3.1,4.0.3 and up,Not English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10531,Kernel Manager for Franco Kernel ✨,TOOLS,4.8,12700,10M,"100,000+",Paid,$3.49,Everyone,Tools,"August 3, 2018",3.2.5,5.0 and up,Not English
10556,FK Željezničar Izzy,FAMILY,4.9,119,17M,"1,000+",Free,0,Everyone,Entertainment,"November 18, 2016",1.0,4.4 and up,Not English
10665,SB · FN 1870 Mobile Banking,FINANCE,2.9,139,3.3M,"10,000+",Free,0,Everyone,Finance,"June 19, 2017",3.0.5,4.0 and up,Not English
10719,Sona - Nær við allastaðni,LIFESTYLE,4.2,31,25M,"1,000+",Free,0,Everyone,Lifestyle,"August 2, 2018",1.6.3,4.4 and up,Not English


Now that we have verified our results, let's create a new dataframe that only has english entries

In [31]:
google_apps_english = google_apps[google_apps['English'] =='English']
google_apps_english

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,English
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,English
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,English
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up,English
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up,English
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up,English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10834,FR Calculator,FAMILY,4.0,7,2.6M,500+,Free,0,Everyone,Education,"June 18, 2017",1.0.0,4.1 and up,English
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up,English
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up,English
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device,English


Now that we've cleaned our data we can begin analysis. Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using the apps. To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after 6 months, we build an iOS version and add it to the App Store.

The end goal is to reach as many customers as possible, and expand the market if the app is successful. To predict the success of our apps, we will create frequency tables to find the most popular categories, and genres to determine what kind of apps have the highest success rates.

In [32]:
frequency_table_category = google_apps_english['Category'].value_counts()
frequency_table_category

Category
FAMILY                 1563
GAME                    891
TOOLS                   709
PRODUCTIVITY            296
FINANCE                 294
PERSONALIZATION         292
MEDICAL                 289
LIFESTYLE               289
PHOTOGRAPHY             260
BUSINESS                260
COMMUNICATION           253
SPORTS                  251
HEALTH_AND_FITNESS      242
SOCIAL                  198
NEWS_AND_MAGAZINES      192
TRAVEL_AND_LOCAL        182
SHOPPING                176
BOOKS_AND_REFERENCE     166
VIDEO_PLAYERS           146
DATING                  132
EDUCATION               115
MAPS_AND_NAVIGATION     111
ENTERTAINMENT           100
FOOD_AND_DRINK           91
AUTO_AND_VEHICLES        72
WEATHER                  70
ART_AND_DESIGN           60
LIBRARIES_AND_DEMO       59
HOUSE_AND_HOME           59
PARENTING                49
COMICS                   49
EVENTS                   45
BEAUTY                   42
Name: count, dtype: int64

Based on the data the most common categories for apps are family, game, and tools. Tools have more than twice as many applications as the fourth most common entry, business, making the top three the most popular by far. Now let's explore the most common applications in the apple appstore to see if there are any similarities in the most popular genres, but first, we must apply the english detector function to the dataframe.

In [33]:
apple_apps['English'] = apple_apps['track_name'].apply(english_detector)
apple_apps

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,English
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1,English
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1,English
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1,English
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1,English
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1,English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7192,1170406182,Shark Boom - Challenge Friends with your Pet,245415936,USD,0.0,0,0,0.0,0.0,1.0.9,4+,Games,38,5,1,1,English
7193,1069830936,【謎解き】ヤミすぎ彼女からのメッセージ,16808960,USD,0.0,0,0,0.0,0.0,1.2,9+,Book,38,0,1,1,Not English
7194,1070052833,Go!Go!Cat!,91468800,USD,0.0,0,0,0.0,0.0,1.1.2,12+,Games,37,2,2,1,English
7195,1081295232,Suppin Detective: Expose their true visage!,83026944,USD,0.0,0,0,0.0,0.0,1.0.3,12+,Entertainment,40,0,1,1,English


In [34]:
apple_apps_english = apple_apps[apple_apps['English']=='English']
apple_apps_english.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic,English
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1,English
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1,English
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1,English
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1,English
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1,English


In [35]:
frequency_table_prime = apple_apps_english['prime_genre'].value_counts()
frequency_table_prime

prime_genre
Games                3323
Entertainment         439
Education             401
Photo & Video         335
Utilities             201
Productivity          165
Health & Fitness      160
Music                 135
Social Networking     124
Sports                101
Lifestyle              96
Shopping               79
Weather                67
Travel                 55
Book                   52
News                   52
Business               51
Reference              49
Finance                47
Food & Drink           44
Navigation             28
Medical                21
Catalogs                5
Name: count, dtype: int64

The most common genre are games and entertainment. Genres related to entertainment dominate the list, followed by apps designed for practical purposes Education, Photo & Video, Utilities, Productivity & Health and Fitness. If I were to recommend an app, I would recommend creating a Game or Entertainment. However, without taking the quality of apps into account this analysis is incomplete. Let's continue with this analysis by looking at the popularity of the apps in each app store. Fortunately, the google app store tracks the number of installations based on category and genre which is a great indicator of popularity. 

In [58]:
popular_google_apps = google_apps_english.groupby(['Category', 'Installs']).size().reset_index(name = 'Frequency')
popular_google_apps.sort_values(by = ['Installs','Frequency'], ascending = [False, False])

Unnamed: 0,Category,Installs,Frequency
89,COMMUNICATION,"500,000,000+",5
405,TOOLS,"500,000,000+",5
191,GAME,"500,000,000+",4
340,PRODUCTIVITY,"500,000,000+",4
282,NEWS_AND_MAGAZINES,"500,000,000+",2
...,...,...,...
63,COMICS,"1,000+",1
283,PARENTING,"1,000+",1
435,WEATHER,"1,000+",1
173,GAME,1+,2


Apps cateogrized as communication, tools, games, and productivity are by far the most popular apps on the google app store. Now let's look at the popularity of apps on the apple app store to try to find a type of app that's widely successful on both platforms. However, we do not have the number of installs available to us on the apple app store so we must find an alternative way to gauge popularity. I believe ratings are an appropriate measure of popularity.

In [54]:
pop_apple_apps = apple_apps_english.groupby(['prime_genre', 'user_rating_ver']).size().reset_index(name = 'Frequency')
pop_apple_apps.sort_values(by = ['user_rating_ver', 'Frequency'], ascending = [False ,False])

Unnamed: 0,prime_genre,user_rating_ver,Frequency
64,Games,5.0,419
128,Photo & Video,5.0,75
35,Entertainment,5.0,46
25,Education,5.0,44
73,Health & Fitness,5.0,39
...,...,...,...
0,Book,0.0,8
7,Business,0.0,7
101,Navigation,0.0,7
139,Reference,0.0,4


On the apple app store games are overwhelmingly the most popular application, with photo and video coming in at a distant 2nd, and entertainment, education, and health and fitness coming in at 3rd, 4th, and 5th respectively. The genre that stands out as popular in both app stores so far are games. To further analyze, let's examine the top rated apps on the google app store as another measure of popularity.

In [79]:
ratings_google_apps = google_apps_english.groupby(['Category','Rating']).size().reset_index(name = 'Frequency')
ratings_google_apps.sort_values(by = ['Rating', 'Frequency'], ascending = [False, False])

Unnamed: 0,Category,Rating,Frequency
244,FAMILY,5.0,65
421,LIFESTYLE,5.0,27
473,MEDICAL,5.0,25
96,BUSINESS,5.0,18
700,TOOLS,5.0,17
...,...,...,...
66,BUSINESS,1.0,1
116,COMMUNICATION,1.0,1
141,DATING,1.0,1
299,GAME,1.0,1


It appears that although family, lifestlye, medical and business are very highly rated, but they do not have as many downloads. Interestingly, the apple app store ratings are categorized into bins. I believe we can have a more comprehensive analysis by creating a column that separates ratings into bins. First, let's make sure the Rating column is the correct datatype.

In [80]:
ratings_google_apps.dtypes['Rating']

dtype('float64')

Now that we have determined ratings is a float, we can establish bins for the data, and examine the output.

In [92]:
google_apps_english['ratings_bins'] = pd.cut(google_apps_english['Rating'], 
                                             bins=[0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5], 
                                             labels=['0.0 - 0.5', '0.5 - 1.0', '1.0 - 1.5', '1.5 - 2.0', '2.0 - 2.5', '2.5 - 3.0', '3.0 - 3.5', '3.5 - 4.0', '4.0 - 4.5', '4.5 - 5.0'], 
                                             right=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  google_apps_english['ratings_bins'] = pd.cut(google_apps_english['Rating'],


In [95]:
ratings_google_apps = google_apps_english.groupby(['Category', 'ratings_bins']).size().reset_index(name = 'Frequency')
ratings_google_apps.sort_values(by = ['ratings_bins', 'Frequency'], ascending = [False, False])

  ratings_google_apps = google_apps_english.groupby(['Category', 'ratings_bins']).size().reset_index(name = 'Frequency')


Unnamed: 0,Category,ratings_bins,Frequency
119,FAMILY,4.5 - 5.0,421
149,GAME,4.5 - 5.0,265
299,TOOLS,4.5 - 5.0,150
159,HEALTH_AND_FITNESS,4.5 - 5.0,111
239,PERSONALIZATION,4.5 - 5.0,106
...,...,...,...
280,SPORTS,0.0 - 0.5,0
290,TOOLS,0.0 - 0.5,0
300,TRAVEL_AND_LOCAL,0.0 - 0.5,0
310,VIDEO_PLAYERS,0.0 - 0.5,0


Upon analyzing the binned data, we've established that games are the second most popular highly rated category behind family. Since both games and family are highly popular on both app stores, I believe the most prudent action would be to create a game for the google app store since the genre is both widely rated and popular on both the Google and Apple app store. This concludes my analysis of the app stores, thank you very much.