# Kaggle Google Playstore Dataset Analysis
## The data for this project is from [kaggle](https://www.kaggle.com/lava18/google-play-store-apps) using SQL

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))
import sqlite3

# Any results you write to the current directory are saved as output.

['googleplaystore_user_reviews.csv', 'googleplaystore.csv']


## Data Exploration

In [2]:
df = pd.read_csv("../input/googleplaystore.csv")
df

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


In [3]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [4]:
# "Installs" column has to be changed to 'float' instead of 'object'
def deal_with_abnormal_strings(data):
    data[data.str.isnumeric()==False]=-1
    data=data.astype(np.float32)
    return data

df.Installs = df.Installs.str.replace("+","")
df.Installs = df.Installs.str.replace(",","")
df.Installs = deal_with_abnormal_strings(df.Installs)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Data Analaysis (Using SQL)

In [5]:
#connect to a database
conn = sqlite3.connect("playstore.db")
#store the table in the database:
df.to_sql('googleplaystore', conn)

  dtype=dtype)


### These are the details of columns 
1) App : Name of the App      
2) Category :  Category under which the App falls.      
3) Rating :  Application's rating on playstore      
4) Reviews :  Number of reviews of the App.       
5) Size :  Size of the App.          
6) Installs :  Number of Installs of the App                 
7) Type :  If the App is free/paid                 
8) Price :  Price of the app (0 if it is Free)                 
9) Content Rating :  Appropiate Target Audience of the App.                 
10) Genres:  Genre under which the App falls.                 
11) Last Updated :  Date when the App was last updated                 
12) Current Ver :  Current Version of the Application                 
13) Android Ver :  Minimum Android Version required to run the App                 

In [6]:
# Shows 10 most common categories of applications
sql_query = '''
    SELECT Category, COUNT(*) AS TotalApps
    FROM googleplaystore 
    GROUP BY Category 
    ORDER BY TotalApps DESC 
    LIMIT 10
    '''
commonApps = pd.read_sql(sql_query, conn)
commonApps = commonApps.to_sql('commonApps', conn) #save as sql table

In [7]:
pd.read_sql('''
    SELECT Category, COUNT(*) AS TotalApps,  ROUND(COUNT(*) * 100.0 / (SELECT COUNT(*) from googleplaystore),2) AS AppsPercentage 
    FROM googleplaystore 
    GROUP BY Category 
    ORDER BY TotalApps DESC 
    LIMIT 10
    ''', conn)

Unnamed: 0,Category,TotalApps,AppsPercentage
0,FAMILY,1972,18.19
1,GAME,1144,10.55
2,TOOLS,843,7.78
3,MEDICAL,463,4.27
4,BUSINESS,460,4.24
5,PRODUCTIVITY,424,3.91
6,PERSONALIZATION,392,3.62
7,COMMUNICATION,387,3.57
8,SPORTS,384,3.54
9,LIFESTYLE,382,3.52


**It shows that the category of "Family" is most common in the Playstore, followed by "Game", and so on.     
18.19% of apps developed are in the category of "Family", 10.55% for "Game", 7.78% for "Tools" etc.**          

**Now, the natural question that follows is, are these common apps actually popular?      
What I mean by "popular" is, do these applications have high number of installs? How about their ratings?      
Let's explore further.**

In [8]:
pd.read_sql(
    '''
    SELECT googleplaystore.Category AS Category, SUM(Installs) AS Total_Installs, AVG(Rating) AS AvgRating 
    FROM googleplaystore 
    INNER JOIN commonApps ON googleplaystore.Category = commonApps.Category 
    GROUP BY googleplaystore.Category 
    ORDER BY Total_Installs DESC
    ''', conn)

Unnamed: 0,Category,Total_Installs,AvgRating
0,GAME,35086020000.0,4.286326
1,COMMUNICATION,32647280000.0,4.158537
2,PRODUCTIVITY,14176090000.0,4.211396
3,TOOLS,11452770000.0,4.047411
4,FAMILY,10258260000.0,4.192272
5,PERSONALIZATION,2325495000.0,4.335987
6,SPORTS,1751174000.0,4.223511
7,BUSINESS,1001915000.0,4.121452
8,LIFESTYLE,537643500.0,4.094904
9,MEDICAL,53257440.0,4.189143


**The table above shows that among 10 most common app categories, users install from "Game" the most, followed by "Communication" and so on. Furthermore, the average rating of each categories is similar across all categories.      
Interestingly, there is no obvious *correlation* between the 'number of apps installed' and the 'number of apps developed' in the Google Play Store.**

**The table below shows the total number of apps installed for each categories from all of the categories.       
It shows that 20.93% of installs are from "Game", 19.48% of installs are from "Communication" etc.**

In [9]:
pd.read_sql(
    '''
    SELECT Category, AVG(Rating) AS AvgRating, ROUND(CAST(CAST(SUM(Installs) * 100 / (SELECT SUM(Installs) FROM googleplaystore) AS DECIMAL(18,2))AS varchar(100)),2) AS InstallsPercentage 
    FROM googleplaystore 
    GROUP BY Category 
    ORDER BY InstallsPercentage DESC
    ''', conn)

Unnamed: 0,Category,AvgRating,InstallsPercentage
0,GAME,4.286326,20.93
1,COMMUNICATION,4.158537,19.48
2,PRODUCTIVITY,4.211396,8.46
3,SOCIAL,4.255598,8.39
4,TOOLS,4.047411,6.83
5,FAMILY,4.192272,6.12
6,PHOTOGRAPHY,4.192114,6.02
7,NEWS_AND_MAGAZINES,4.132189,4.47
8,TRAVEL_AND_LOCAL,4.109292,4.1
9,VIDEO_PLAYERS,4.06375,3.71


**This time, I would like to analyze in terms of "Content Rating".        
The table below shows (as expected) that apps for "Everyone" is developed most by developers and used most by users. **

In [10]:
pd.read_sql(
    '''
    SELECT COUNT(*) AS TotalApps, SUM(Installs)  AS TotalInstalls , "Content Rating"
    FROM googleplaystore 
    GROUP BY "Content Rating"
    ORDER BY TotalInstalls DESC
    ''', conn)

Unnamed: 0,TotalApps,TotalInstalls,Content Rating
0,8714,114156700000.0,Everyone
1,1208,34716350000.0,Teen
2,414,13233880000.0,Everyone 10+
3,499,5524491000.0,Mature 17+
4,3,2000000.0,Adults only 18+
5,2,50500.0,Unrated
6,1,-1.0,
