# EDA: Google Playstore Data

# Project Preview

<img src="../assets/picture.jpg" alt="Title-Pic">

## Data StoryTelling

This dataset came from <a href="https://datacamp.com">datacamp.com</a> and contains data of the Google Playstore. <br>
We want to find out, which kind of apps are the most popular e.g. by the rating and download rates.

<br>

## Data questions

### Main-Topics

#### Which kind of apps got the best rating?

- genre
- content ratings
- free / paid apps
- price category at paid apps
- review count of app
- app size

#### Which kind of apps got the most downloads?

- genre
- rating
- content ratings
- free / paid apps
- price category at paid apps
- review count of app
- app size

<br>

### General-Topics

#### Genre

- What are the most published genres?
- What are the top genres (by rating, downloads & in combination)?
- Which genre got the best rating-installment combination (weighted: rating=30%, downloads=70%)

#### Free & Paid Apps

- How is the distribution of paid and free apps?
- Do paid apps get a better rating than free apps?
- In terms of total releases, are paid apps downloaded more than free apps?
- Does the price affect the rating (3 price categories)? Do high price apps got an better rating?
- Does the price affect the downloads (3 price categories)? Do high price apps got an better download rate?
- Which price category got the best rating-download combination when we want the highest turnover?

#### App Rating

- How is the total rating distribution over all apps (10 (0.5 - 5) categories)?
- How is the total rating distribution over all apps (5 (1 - 5) categories)?
- Got a app with many reviews an better rating? Is there a significant threshold?
- Is there an relationship between the rating and the size of the app? Do bigger apps got an better rating, because of the higher functionality density?
- How is the rating distribution of the different content ratings?
- Are higher rated apps more downloaded?

#### Other

- Are bigger apps more downloaded then smaller apps?
- Which content rating categories will downloaded the most?

## Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas import Series, DataFrame


np.set_printoptions(suppress=True)

sns.set(rc={"figure.figsize": (10, 6), "axes.titlesize": 20, "axes.titleweight": "bold", "axes.labelsize": 15})
sns.set_palette("Set2")

## Data overview

In [3]:
DATA_PATH = "../data/apps.csv"
raw_data_df = pd.read_csv(DATA_PATH, delimiter=",")
raw_data_df

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9654,10836,Sya9a Maroc - FR,FAMILY,4.5,38,53.0,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
9655,10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6,100+,Free,0,Everyone,Education,"July 6, 2018",1,4.1 and up
9656,10838,Parkinson Exercices FR,MEDICAL,,3,9.5,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1,2.2 and up
9657,10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [4]:
df_cleaned = raw_data_df.copy()
df_cleaned.head()

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [6]:
df_cleaned.shape[0]

9659

In [5]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      9659 non-null   int64  
 1   App             9659 non-null   object 
 2   Category        9659 non-null   object 
 3   Rating          8196 non-null   float64
 4   Reviews         9659 non-null   int64  
 5   Size            8432 non-null   float64
 6   Installs        9659 non-null   object 
 7   Type            9659 non-null   object 
 8   Price           9659 non-null   object 
 9   Content Rating  9659 non-null   object 
 10  Genres          9659 non-null   object 
 11  Last Updated    9659 non-null   object 
 12  Current Ver     9651 non-null   object 
 13  Android Ver     9657 non-null   object 
dtypes: float64(2), int64(2), object(10)
memory usage: 1.0+ MB


Missing values:
- Rating
- Size
- Current Ver
- Android Ver

In [7]:
df_cleaned.describe()

Unnamed: 0.1,Unnamed: 0,Rating,Reviews,Size
count,9659.0,8196.0,9659.0,8432.0
mean,5666.172896,4.173243,216592.6,20.395327
std,3102.362863,0.536625,1831320.0,21.827509
min,0.0,1.0,0.0,0.0
25%,3111.5,4.0,25.0,4.6
50%,5814.0,4.3,967.0,12.0
75%,8327.5,4.5,29401.0,28.0
max,10840.0,5.0,78158310.0,100.0


## Data cleaning & preprocessing

- Drop columns: {Unnamed: 0, Android Ver, Current Ver, Last Updated}
- Rename columns: {Installs: Downloads, Content Rating: Content Group}

<br>

- Category (is fine)
- Rating
- Reviews
- Size
- Installs
- Type
- Price
- Content Rating
- Genres  

<br>

- Genre -> ";" split and copy row!
- drop all ratings under 15 reviews (each had the chance to get voted)
- ratings -> 0 - 5 .5 steps
- ratings -> 0 - 5 1 steps (.5 rounded)

### Columns and constants

In [8]:
df_cleaned.columns

Index(['Unnamed: 0', 'App', 'Category', 'Rating', 'Reviews', 'Size',
       'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated',
       'Current Ver', 'Android Ver'],
      dtype='object')

In [11]:
# columns
UNNAMED = "Unnamed: 0"
APP_NAME = "App"
CATEGORY = "Category"
RATING = "Rating"
REVIEWS = "Reviews"
SIZE = "Size"
INSTALLS = "Installs"
DOWNLOADS = "Downloads"
TYPE = "Type"
PRICE = "Price"
CONTENT_RATING = "Content Rating"
CONTENT_GROUP = "Content Group"
GENRE = "Genres"
LAST_UPDATED = "Last Updated"
CURR_VERSION = "Current Ver"
ANDROID_VERSION = "Android Ver"

# added columns

# notebook constants
COUNT = "count"
MEAN = "mean"
SUM = "sum"
MEDIAN = "median"

### Drop columns: {Unnamed: 0, Android Ver, Current Ver, Last Updated}

In [10]:
df_cleaned.drop(columns={UNNAMED, ANDROID_VERSION, CURR_VERSION, LAST_UPDATED}, inplace=True)
df_cleaned.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres'],
      dtype='object')

### Rename columns: {Installs: Downloads, Content Rating: Content Group}

In [12]:
renaming_map = {INSTALLS: DOWNLOADS, CONTENT_RATING: CONTENT_GROUP}
df_cleaned.rename(columns=renaming_map, inplace=True)
df_cleaned.columns

Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Downloads', 'Type',
       'Price', 'Content Group', 'Genres'],
      dtype='object')

### Category

In [15]:
df_cleaned[CATEGORY].unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'],
      dtype=object)

In [16]:
df_cleaned[CATEGORY].value_counts()

FAMILY                 1832
GAME                    959
TOOLS                   827
BUSINESS                420
MEDICAL                 395
PERSONALIZATION         376
PRODUCTIVITY            374
LIFESTYLE               369
FINANCE                 345
SPORTS                  325
COMMUNICATION           315
HEALTH_AND_FITNESS      288
PHOTOGRAPHY             281
NEWS_AND_MAGAZINES      254
SOCIAL                  239
BOOKS_AND_REFERENCE     222
TRAVEL_AND_LOCAL        219
SHOPPING                202
DATING                  171
VIDEO_PLAYERS           163
MAPS_AND_NAVIGATION     131
EDUCATION               119
FOOD_AND_DRINK          112
ENTERTAINMENT           102
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       84
WEATHER                  79
HOUSE_AND_HOME           74
EVENTS                   64
ART_AND_DESIGN           64
PARENTING                60
COMICS                   56
BEAUTY                   53
Name: Category, dtype: int64

In [17]:
df_cleaned[CATEGORY].isna().sum()

0

That feature is fine!

### Rating

In [18]:
df_cleaned[RATING]

0       4.1
1       3.9
2       4.7
3       4.5
4       4.3
       ... 
9654    4.5
9655    5.0
9656    NaN
9657    4.5
9658    4.5
Name: Rating, Length: 9659, dtype: float64

In [19]:
df_cleaned[RATING].isna().sum()

1463

### Reviews

## Data visualization & interpretation