# Exploratory Data Analysis of Google Play Store apps

### Content:

 - Introduction 
 - Data description 
 - Researh questions formulation
 - Data preparation 

# Introduction

 Google Play Store - is a distribution service established by Google. It allows side companies to spread and offer a variety    of applications compatible with Android operation system. Initially, Google announced the opening of Andorid Market online-shop on the 22nd of October in 2008. It provided a combination of music, book, video and games services. Talking exactly about applications, the developers pay a 25$ fee for a publication ability. 
 
 As a result of rebrending that took place in March, 2012, Android Market portal became a currently known Google Play Store. According to the monetization policy, the application developers basing on this platform recieve 70 percent of profit, while other 30 percent go for billing maintenance. From that time the folowing sequence of major updates was released: 2014 - paying with Paypal function; March, 2016 - free 10 minute game trial in browser function; April, 2016 - advertising opportunities. 
 
 As seen from the above, Google Play Market is a loyal and dynamic platform for app developers, thus there could be found more than 2,9 million applications on the free or paid base. The Google Play inludes 34 app categories such as Business, Health, Education, Arcade and etc. Overall, users from 145 country have an opportunity to install and users from 150 countries could spread applications. 

Retrieved from: https://ru.wikipedia.org/wiki/Google_Play

# Data description

Google Play Store phenomenon provides quite a huge base for a variety of analysises from different aspects. It would be beneficial to analyze the efficiency of particular applications grouped by size, rating, genre etc. for future app publications. The current analysis would be based on the "Google Play Store Apps" Kaggle dataset initially scraped from Google Play Store by Lavanya Gupta(https://www.kaggle.com/lava18/google-play-store-apps). The dataset is relatively new, dated by 2018.

It consist of two .csv files but the major and the most informative part enough for analysis purposes is in the googleplaystore.csv and the second file contains too much missing values, therefore the decision not to use the second file was made. The following parameters from googleplaystore.csv will be used: 

- Application name - as given in Google Play Store
- Category - the category the app belongs to
- Rating - position in the user rating on the scrapping moment
- Reviews - number of reviews on the scrapping moment
- Size - the size of application 
- Installs - number of user downloads
- Type - either paid or free base described
- Price - the amount payed for installation if needed
- Content Rating - targetted age group 
- Genre - an additional subcategory if exists
- Last Updated - last available update



# Research questions formulation

In the result of this analysis 5 main questions would be answered: 

1. Analyze weather the number of previous reviews influences the choice while installing.
2. Analyze the relationship between the size of the app and the rating.
3. Analyze the distribution of pay-based applications amoung different categories.
4. Analyze the relathionship between content rating and the number of installs.
5. Anlayze the relationship between content rating and the fee requirement. 

# Data preparation

Data preparation will be conducted in several steps using pandas and numpy. In order to identify the exact actions to be made the data will be retrived from the .csv file and generally examined by the following code

In [1]:
#importing pandas module
import pandas as pd
#retrieving data from .csv file
apps = pd.read_csv('C:\\Users\\Kamilla\\Downloads\\googleplaystore.csv')
#showing 5 first rows
apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [2]:
 print('Dataset parameters are:', apps.shape)

Dataset parameters are: (10841, 13)


Which actually means that dataset contains 10841 observation and 13 parameter columns

In [18]:
apps.describe()

Unnamed: 0,Rating,Size,Installs
count,7424.0,7424.0,7424.0
mean,4.171309,20848480.0,7823918.0
std,0.549729,24906090.0,46304110.0
min,1.0,1.0,1.0
25%,4.0,5.9,10000.0
50%,4.3,14000000.0,100000.0
75%,4.5,33000000.0,1000000.0
max,5.0,100000000.0,1000000000.0


The primary data description shows that the number of observations is same, e+03/06 in installs and size just show the Kilo/Mega units. The mean rating is quite high - 4.17. 

At this moment further data preparation steps are clear:

- Remove unnecessary columns
- Remove duplicate rows
- Identify missing values and replace/delete them 
- Convert inappropriate values in some columns so that it would be easier to work with them

#### 1. Table redesigning 

In context of current analysis "Current ver" and "Android ver" columns are not required. Therefore, the code above will deal with this aspect by deleting columns

In [3]:
#deleting column with "Current Ver" name from apps
del apps ["Current Ver"]
apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",4.4 and up


In [4]:
#deleting column with "Current Ver" name from apps
del apps ["Android Ver"]
apps.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018"


As a result dataset has 11 columns.

The code below is devoted to deleting any duplications, as a result 10358 observations are left.

In [5]:
#assiging new table without any duplications
apps = apps.drop_duplicates()
apps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018"
...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017"
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018"
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017"
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015"


#### 2. Missing values 

The following code shows the total number of missing values by each parameter. 

In [6]:
 apps.isnull().sum()

App                  0
Category             0
Rating            1465
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
dtype: int64

As could be seen, the most missed column is Rating, which is also very important, therefore it could not be dropped, however missing values cound not be replaced or interpollated as well, due to the fact that Rating, Type, Content Rating columns do not have any trend or prediction opportunity. Thus, as the total number of observations is quite huge, dropping of NULL rows will be executed further.

In [7]:
#dropping rows with Nan values in the following subset of columns
apps.dropna(subset = ["Rating", "Type", "Content Rating"], inplace=True)
apps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018"
...,...,...,...,...,...,...,...,...,...,...,...
10834,FR Calculator,FAMILY,4.0,7,2.6M,500+,Free,0,Everyone,Education,"June 18, 2017"
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017"
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018"
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015"


Another problem to deal with is "Varies with device" values in some columns. The code shows 1468 rows with such value. This might mess up the further analysis, thus they should be replaced with NaN and dealt with.

In [8]:
apps.loc[apps["Size"] == 'Varies with device']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
37,Floor Plan Creator,ART_AND_DESIGN,4.1,36639,Varies with device,"5,000,000+",Free,0,Everyone,Art & Design,"July 14, 2018"
42,Textgram - write on photos,ART_AND_DESIGN,4.4,295221,Varies with device,"10,000,000+",Free,0,Everyone,Art & Design,"July 30, 2018"
52,Used Cars and Trucks for Sale,AUTO_AND_VEHICLES,4.6,17057,Varies with device,"1,000,000+",Free,0,Everyone,Auto & Vehicles,"July 30, 2018"
67,Ulysse Speedometer,AUTO_AND_VEHICLES,4.3,40211,Varies with device,"5,000,000+",Free,0,Everyone,Auto & Vehicles,"July 30, 2018"
68,REPUVE,AUTO_AND_VEHICLES,3.9,356,Varies with device,"100,000+",Free,0,Everyone,Auto & Vehicles,"May 25, 2018"
...,...,...,...,...,...,...,...,...,...,...,...
10713,My Earthquake Alerts - US & Worldwide Earthquakes,WEATHER,4.4,3471,Varies with device,"100,000+",Free,0,Everyone,Weather,"July 24, 2018"
10725,Posta App,MAPS_AND_NAVIGATION,3.6,8,Varies with device,"1,000+",Free,0,Everyone,Maps & Navigation,"September 27, 2017"
10765,Chat For Strangers - Video Chat,SOCIAL,3.4,622,Varies with device,"100,000+",Free,0,Mature 17+,Social,"May 23, 2018"
10826,Frim: get new friends on local chat rooms,SOCIAL,4.0,88486,Varies with device,"5,000,000+",Free,0,Mature 17+,Social,"March 23, 2018"


In [12]:
#importing numpy module
import numpy as np
#replacing "Varies with device" values with NaN
apps = apps.replace(to_replace = "Varies with device", value =np.nan)  
#dropping NaN values
apps.dropna(subset = ["Size"], inplace=True)
#showing the result
apps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,10000,Free,0,Everyone,Art & Design,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,500000,Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,0,Everyone,Art & Design,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,50000000,Free,0,Teen,Art & Design,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,100000,Free,0,Everyone,Art & Design;Creativity,"June 20, 2018"
...,...,...,...,...,...,...,...,...,...,...,...
10833,Chemin (fr),BOOKS_AND_REFERENCE,4.8,44,619000.0,1000,Free,0,Everyone,Books & Reference,"March 23, 2014"
10834,FR Calculator,FAMILY,4.0,7,2.6,500,Free,0,Everyone,Education,"June 18, 2017"
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53000000.0,5000,Free,0,Everyone,Education,"July 25, 2017"
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6,100,Free,0,Everyone,Education,"July 6, 2018"


The previous code replaced all "Varies with device" values to NaN and then dropped all rows with NaN values. Now, records in the whole table are full.

#### 3. Values converting

Further analysis requires Size and Installs columns to be numeric, however, they have units and '+' sign respectively. The next code lines will convert these columns values into the proper type.

In order to make Size values numeric, firstly the Kilo and Mega letters were replaced with "000" and "000000" respectively and then, the whole value was converted to numeric.

In [10]:
apps.Size = apps.Size.str.replace('k','000')
apps.Size = apps.Size.str.replace('M','000000')
apps.Size = pd.to_numeric(apps.Size)
apps

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018"
...,...,...,...,...,...,...,...,...,...,...,...
10833,Chemin (fr),BOOKS_AND_REFERENCE,4.8,44,619000.0,"1,000+",Free,0,Everyone,Books & Reference,"March 23, 2014"
10834,FR Calculator,FAMILY,4.0,7,2.6,500+,Free,0,Everyone,Education,"June 18, 2017"
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53000000.0,"5,000+",Free,0,Everyone,Education,"July 25, 2017"
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6,100+,Free,0,Everyone,Education,"July 6, 2018"


The same logic was followed to remove plus sign and commas and convert to numeric type the Installs column

In [11]:
apps.Installs = apps.Installs.str.replace('+','')
apps.Installs = apps.Installs.str.replace(',','')
apps.Installs = pd.to_numeric(apps.Installs)
apps


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,10000,Free,0,Everyone,Art & Design,"January 7, 2018"
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,500000,Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018"
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,5000000,Free,0,Everyone,Art & Design,"August 1, 2018"
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,50000000,Free,0,Teen,Art & Design,"June 8, 2018"
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,100000,Free,0,Everyone,Art & Design;Creativity,"June 20, 2018"
...,...,...,...,...,...,...,...,...,...,...,...
10833,Chemin (fr),BOOKS_AND_REFERENCE,4.8,44,619000.0,1000,Free,0,Everyone,Books & Reference,"March 23, 2014"
10834,FR Calculator,FAMILY,4.0,7,2.6,500,Free,0,Everyone,Education,"June 18, 2017"
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53000000.0,5000,Free,0,Everyone,Education,"July 25, 2017"
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6,100,Free,0,Everyone,Education,"July 6, 2018"


On this stage the dataset is clean and well-shaped. 