# EDA: Google Playstore Data

# Project Preview

<img src="../assets/picture.jpg" alt="Title-Pic">

## Data StoryTelling

This dataset came from <a href="https://datacamp.com">datacamp.com</a> and contains data of the Google Playstore. <br>
We want to find out, which kind of apps are the most popular e.g. by the rating and download rates.

<br>

## Data questions

### Main-Topics

#### Which kind of apps got the best rating?

- genre
- content ratings
- free / paid apps
- price category at paid apps
- review count of app
- app size

#### Which kind of apps got the most downloads?

- genre
- rating
- content ratings
- free / paid apps
- price category at paid apps
- review count of app
- app size

<br>

### General-Topics

#### Genre

- What are the most published genres?
- What are the top genres (by rating, downloads & in combination)?
- Which genre got the best rating-installment combination (weighted: rating=30%, downloads=70%)

#### Free & Paid Apps

- How is the distribution of paid and free apps?
- Do paid apps get a better rating than free apps?
- In terms of total releases, are paid apps downloaded more than free apps?
- Does the price affect the rating (3 price categories)? Do high price apps got an better rating?
- Does the price affect the downloads (3 price categories)? Do high price apps got an better download rate?
- Which price category got the best rating-download combination when we want the highest turnover?

#### App Rating

- How is the total rating distribution over all apps (10 (0.5 - 5) categories)?
- How is the total rating distribution over all apps (5 (1 - 5) categories)?
- Got a app with many reviews an better rating? Is there a significant threshold?
- Is there an relationship between the rating and the size of the app? Do bigger apps got an better rating, because of the higher functionality density?
- How is the rating distribution of the different content ratings?
- Are higher rated apps more downloaded?

#### Other

- Are bigger apps more downloaded then smaller apps?
- Which content rating categories will downloaded the most?

## Imports

In [1583]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from pandas import Series, DataFrame


np.set_printoptions(suppress=True)

sns.set(rc={"figure.figsize": (10, 6), "axes.titlesize": 20, "axes.titleweight": "bold", "axes.labelsize": 15})
sns.set_palette("Set2")

## Data overview

In [1584]:
DATA_PATH = "../data/apps.csv"
raw_data_df = pd.read_csv(DATA_PATH, delimiter=",")
raw_data_df

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9654,10836,Sya9a Maroc - FR,FAMILY,4.5,38,53.0,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
9655,10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6,100+,Free,0,Everyone,Education,"July 6, 2018",1,4.1 and up
9656,10838,Parkinson Exercices FR,MEDICAL,,3,9.5,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1,2.2 and up
9657,10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


In [1585]:
df_cleaned = raw_data_df.copy()
df_cleaned.head()

Unnamed: 0.1,Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [1586]:
df_cleaned.shape[0]

9659

In [1587]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9659 entries, 0 to 9658
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      9659 non-null   int64  
 1   App             9659 non-null   object 
 2   Category        9659 non-null   object 
 3   Rating          8196 non-null   float64
 4   Reviews         9659 non-null   int64  
 5   Size            8432 non-null   float64
 6   Installs        9659 non-null   object 
 7   Type            9659 non-null   object 
 8   Price           9659 non-null   object 
 9   Content Rating  9659 non-null   object 
 10  Genres          9659 non-null   object 
 11  Last Updated    9659 non-null   object 
 12  Current Ver     9651 non-null   object 
 13  Android Ver     9657 non-null   object 
dtypes: float64(2), int64(2), object(10)
memory usage: 1.0+ MB


Missing values:
- Rating
- Size
- Current Ver
- Android Ver

In [1588]:
df_cleaned.describe()

Unnamed: 0.1,Unnamed: 0,Rating,Reviews,Size
count,9659.0,8196.0,9659.0,8432.0
mean,5666.172896,4.173243,216592.6,20.395327
std,3102.362863,0.536625,1831320.0,21.827509
min,0.0,1.0,0.0,0.0
25%,3111.5,4.0,25.0,4.6
50%,5814.0,4.3,967.0,12.0
75%,8327.5,4.5,29401.0,28.0
max,10840.0,5.0,78158310.0,100.0


## Data cleaning & preprocessing

- Drop columns: {Unnamed: 0, Android Ver, Current Ver, Last Updated}
- Rename columns: {Installs: Downloads, Content Rating: Content Group}

<br>

- Category (is fine)
- Rating
- Reviews
- Size
- Installs
- Type
- Price
- Content Rating
- Genres  

<br>

- Genre -> ";" split and copy row!
- drop all ratings under 15 reviews (each had the chance to get voted)
- ratings -> 0 - 5 .5 steps
- ratings -> 0 - 5 1 steps (.5 rounded)

### Columns and constants

In [1589]:
df_cleaned.columns

Index(['Unnamed: 0', 'App', 'Category', 'Rating', 'Reviews', 'Size',
       'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated',
       'Current Ver', 'Android Ver'],
      dtype='object')

In [1590]:
# columns
UNNAMED = "Unnamed: 0"
APP_NAME = "App"
CATEGORY = "Category"
RATING = "Rating"
REVIEWS = "Reviews"
SIZE = "Size"
INSTALLS = "Installs"
TYPE = "Type"
PRICE = "Price"
CONTENT_RATING = "Content Rating"
GENRE = "Genres"
LAST_UPDATED = "Last Updated"
CURR_VERSION = "Current Ver"
ANDROID_VERSION = "Android Ver"

# added columns
DOWNLOADS = "Downloads"
DOWNLOAD_RATE = "Download Rate"
CONTENT_GROUP = "Content Group"
RATING_CLASS = "Rating Class"
REVIEW_RATE = "Review Rate"
SIZE_CLASS = "Size Class"
PRICE_CLASS = "Price Class"

# notebook constants
COUNT = "count"
MEAN = "mean"
SUM = "sum"
MEDIAN = "median"

### Drop columns: {Unnamed: 0, Category, Android Ver, Current Ver, Last Updated}

In [1591]:
df_cleaned.drop(columns={UNNAMED, CATEGORY, ANDROID_VERSION, CURR_VERSION, LAST_UPDATED}, inplace=True)
df_cleaned.columns

Index(['App', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price',
       'Content Rating', 'Genres'],
      dtype='object')

### Rename columns: {Installs: Downloads, Content Rating: Content Group}

In [1592]:
renaming_map = {INSTALLS: DOWNLOADS, CONTENT_RATING: CONTENT_GROUP}
df_cleaned.rename(columns=renaming_map, inplace=True)
df_cleaned.columns

Index(['App', 'Rating', 'Reviews', 'Size', 'Downloads', 'Type', 'Price',
       'Content Group', 'Genres'],
      dtype='object')

### Rating (1/2)

In [1593]:
df_cleaned[RATING]

0       4.1
1       3.9
2       4.7
3       4.5
4       4.3
       ... 
9654    4.5
9655    5.0
9656    NaN
9657    4.5
9658    4.5
Name: Rating, Length: 9659, dtype: float64

In [1594]:
na_values = df_cleaned[RATING].isna().sum()
total_values = df_cleaned.shape[0]
na_values, total_values, na_values / total_values

(1463, 9659, 0.15146495496428203)

15% of the data doesn't contain rating values. <br>
We interpolate them by the genre median.

In [1595]:
na_values_genre = df_cleaned.loc[df_cleaned[RATING].isna(), GENRE].unique()
na_values_genre

array(['Art & Design;Action & Adventure', 'Beauty', 'Books & Reference',
       'Business', 'Comics', 'Dating', 'Education', 'Events',
       'Food & Drink', 'House & Home', 'Libraries & Demo', 'Medical',
       'Tools', 'Parenting;Education', 'Parenting',
       'Video Players & Editors', 'Personalization', 'Racing',
       'Photography', 'Social', 'Arcade', 'Sports', 'Communication',
       'Music', 'Trivia', 'Productivity', 'Educational;Education',
       'Board', 'Entertainment', 'Auto & Vehicles', 'Finance',
       'Lifestyle', 'Travel & Local', 'Shopping', 'Health & Fitness',
       'Weather', 'Simulation', 'Casual', 'News & Magazines',
       'Maps & Navigation', 'Action', 'Card', 'Strategy', 'Educational',
       'Puzzle', 'Casino', 'Education;Brain Games', 'Trivia;Education',
       'Word', 'Art & Design', 'Books & Reference;Creativity',
       'Role Playing', 'Adventure', 'Arcade;Action & Adventure',
       'Educational;Pretend Play', 'Role Playing;Education'], dtype=object)

In [1596]:
na_values_genre.shape[0], df_cleaned[GENRE].unique().shape[0]

(56, 118)

Before we can interpolate the rating by the genre, we need to clean the genre.

### Genre

In [1597]:
df_cleaned[GENRE].value_counts()

Tools                              826
Entertainment                      561
Education                          510
Business                           420
Medical                            395
                                  ... 
Art & Design;Pretend Play            1
Lifestyle;Pretend Play               1
Comics;Creativity                    1
Art & Design;Action & Adventure      1
Strategy;Creativity                  1
Name: Genres, Length: 118, dtype: int64

In [1598]:
df_cleaned[GENRE].unique()

array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Art & Design;Action & Adventure',
       'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
       'Comics', 'Comics;Creativity', 'Communication', 'Dating',
       'Education;Education', 'Education', 'Education;Creativity',
       'Education;Music & Video', 'Education;Action & Adventure',
       'Education;Pretend Play', 'Education;Brain Games', 'Entertainment',
       'Entertainment;Music & Video', 'Entertainment;Brain Games',
       'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Lifestyle;Pretend Play',
       'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
       'Casual;Pretend Play', 'Action', 'Strategy', 'Puzzle', 'Sports',
       'Music', 'Word', 'Racing', 'Casual;Creativity',
       'Casual;Action & Adventure', 'Simulation', 'Adventure', 'Board',
       'Trivia', 'Role 

In [1599]:
df_cleaned[GENRE].unique().shape[0]

118

There are multiple genre in one record, separated by semicolon. <br>
We split them and copy the duplicate the record.

In [1600]:
curr_row_count = df_cleaned.shape[0]
curr_row_count

9659

In [1601]:
is_multiple_genre = df_cleaned[GENRE].str.contains(";")
df_multiple_genre = df_cleaned[is_multiple_genre]
df_multiple_genre.shape[0]

393

In [1602]:
df_cleaned

Unnamed: 0,App,Rating,Reviews,Size,Downloads,Type,Price,Content Group,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design
1,Coloring book moana,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design
3,Sketch - Draw & Paint,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design
4,Pixel Draw - Number Art Coloring Book,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity
...,...,...,...,...,...,...,...,...,...
9654,Sya9a Maroc - FR,4.5,38,53.0,"5,000+",Free,0,Everyone,Education
9655,Fr. Mike Schmitz Audio Teachings,5.0,4,3.6,100+,Free,0,Everyone,Education
9656,Parkinson Exercices FR,,3,9.5,"1,000+",Free,0,Everyone,Medical
9657,The SCP Foundation DB fr nn5n,4.5,114,,"1,000+",Free,0,Mature 17+,Books & Reference


In [1603]:
added_row_count = 0
for row_index, row in df_multiple_genre.iterrows():
    genres = row[GENRE].split(";")
    for i in range(1, len(genres)):
        added_row_count = added_row_count + 1
        row[GENRE] = genres[i]
        df_cleaned = pd.concat([df_cleaned, DataFrame(row).T], ignore_index=True)
    
    df_cleaned.loc[row_index, GENRE] = genres[0]
df_cleaned.shape[0], added_row_count, df_cleaned.shape[0] - added_row_count, curr_row_count

(10052, 393, 9659, 9659)

In [1604]:
df_cleaned

Unnamed: 0,App,Rating,Reviews,Size,Downloads,Type,Price,Content Group,Genres
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design
1,Coloring book moana,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design
3,Sketch - Draw & Paint,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design
4,Pixel Draw - Number Art Coloring Book,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design
...,...,...,...,...,...,...,...,...,...
10047,PBS KIDS Games,4.3,12919,94.0,"1,000,000+",Free,0,Everyone,Education
10048,Dolphin and fish coloring book,3.9,2249,,"500,000+",Free,0,Everyone,Creativity
10049,Cake Shop - Kids Cooking,4.3,30668,33.0,"5,000,000+",Free,0,Everyone,Pretend Play
10050,Hair saloon - Spa salon,4.2,38473,23.0,"10,000,000+",Free,0,Everyone,Pretend Play


In [1605]:
df_cleaned[GENRE].str.contains(";").sum()

0

In [1606]:
df_cleaned[GENRE].unique().shape[0]

53

Unique values from 118 to 53.

### Rating (2/2)

In [1607]:
df_cleaned.describe()

Unnamed: 0,App,Rating,Reviews,Size,Downloads,Type,Price,Content Group,Genres
count,10052,8575.0,10052,8781.0,10052,10052,10052,10052,10052
unique,9659,39.0,5330,191.0,21,2,92,6,53
top,Kids Learn Languages by Mondly,4.3,0,12.0,"1,000,000+",Free,0,Everyone,Tools
freq,2,952.0,596,185.0,1519,9226,9226,8272,827


#### Missing values

In [1608]:
df_cleaned[RATING].isna().sum()

1477

The na values count increase just by 14 values

##### Interpolate na values by the genre median.

In [1609]:
df_genre_median_map = df_cleaned.pivot_table(index=GENRE, values=RATING, aggfunc=MEDIAN).to_dict()[RATING]
df_genre_median_map

{'Action': 4.3,
 'Action & Adventure': 4.3,
 'Adventure': 4.3,
 'Arcade': 4.3,
 'Art & Design': 4.4,
 'Auto & Vehicles': 4.3,
 'Beauty': 4.3,
 'Board': 4.3,
 'Books & Reference': 4.5,
 'Brain Games': 4.4,
 'Business': 4.2,
 'Card': 4.25,
 'Casino': 4.4,
 'Casual': 4.2,
 'Comics': 4.4,
 'Communication': 4.2,
 'Creativity': 4.4,
 'Dating': 4.1,
 'Education': 4.4,
 'Educational': 4.2,
 'Entertainment': 4.2,
 'Events': 4.5,
 'Finance': 4.3,
 'Food & Drink': 4.3,
 'Health & Fitness': 4.5,
 'House & Home': 4.2,
 'Libraries & Demo': 4.2,
 'Lifestyle': 4.2,
 'Maps & Navigation': 4.2,
 'Medical': 4.3,
 'Music': 4.3,
 'Music & Audio': 4.3,
 'Music & Video': 4.2,
 'News & Magazines': 4.2,
 'Parenting': 4.4,
 'Personalization': 4.4,
 'Photography': 4.3,
 'Pretend Play': 4.2,
 'Productivity': 4.3,
 'Puzzle': 4.4,
 'Racing': 4.2,
 'Role Playing': 4.3,
 'Shopping': 4.3,
 'Simulation': 4.2,
 'Social': 4.3,
 'Sports': 4.3,
 'Strategy': 4.3,
 'Tools': 4.2,
 'Travel & Local': 4.2,
 'Trivia': 4.25,
 'Vide

In [1610]:
is_rating_nan = df_cleaned[RATING].isna()
df_cleaned.loc[is_rating_nan, RATING] = df_cleaned.loc[is_rating_nan, GENRE].map(df_genre_median_map)
df_cleaned[RATING].isna().sum()

0

We ware able to eliminate all missing values in the ranting feature!

#### Abnormal Ratings

In [1611]:
df_cleaned[df_cleaned[RATING] < 0].shape[0], df_cleaned[df_cleaned[RATING] > 5].shape[0]

(0, 0)

Looks fine!

#### Classification of Rating Class {1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, 5}

In [1612]:
def get_rating_class(rating):
    rating_class = int(rating * 10 / 5) / 2 + 0.5
    return rating_class if rating_class < 5 and rating_class >= 0.5 else 5 if rating_class > 5 else 0.5

df_cleaned[RATING_CLASS] = df_cleaned[RATING].apply(lambda rating: get_rating_class(rating))
df_cleaned[RATING_CLASS].value_counts().sort_index()

0.5    2520
1.5      20
2.0      34
2.5      73
3.0     153
3.5     445
4.0    1234
4.5    5301
5.0     272
Name: Rating Class, dtype: int64

In [1613]:
df_cleaned[(df_cleaned[RATING] >= 0.5) & (df_cleaned[RATING] < 1)].shape[0]

0

### Reviews

In [1614]:
df_cleaned[REVIEWS].describe()

count     10052
unique     5330
top           0
freq        596
Name: Reviews, dtype: int64

#### Missing values

In [1615]:
df_cleaned[REVIEWS].isna().sum()

0

#### Classification of reviews to Review Rate {High, Medium, Low}

In [1616]:
q1, q2 = df_cleaned[REVIEWS].quantile(q=[1/3, 2/3])
df_cleaned[REVIEW_RATE] = df_cleaned[REVIEWS].apply(lambda review_count: "Low" if review_count <= q1 else "Medium" if review_count <= q2 else "High")
df_cleaned[REVIEW_RATE].value_counts()

High      3351
Low       3351
Medium    3350
Name: Review Rate, dtype: int64

### Size

In [1617]:
df_cleaned[SIZE].describe()

count     8781.0
unique     191.0
top         12.0
freq       185.0
Name: Size, dtype: float64

#### Missing values

In [1618]:
is_size_na = df_cleaned[SIZE].isna()
is_size_na.sum()

1271

We interpolate the size by the median size of the genre.

In [1619]:
genre_median_size_map = df_cleaned.pivot_table(index=GENRE, values=SIZE, aggfunc=MEDIAN).to_dict()[SIZE]
genre_median_size_map

{'Action': 45.0,
 'Action & Adventure': 39.0,
 'Adventure': 31.0,
 'Arcade': 36.0,
 'Art & Design': 8.95,
 'Auto & Vehicles': 16.0,
 'Beauty': 9.2,
 'Board': 17.0,
 'Books & Reference': 7.8,
 'Brain Games': 20.5,
 'Business': 8.6,
 'Card': 23.0,
 'Casino': 26.0,
 'Casual': 26.0,
 'Comics': 10.0,
 'Communication': 5.7,
 'Creativity': 28.0,
 'Dating': 11.0,
 'Education': 12.0,
 'Educational': 46.0,
 'Entertainment': 8.2,
 'Events': 9.7,
 'Finance': 12.0,
 'Food & Drink': 17.0,
 'Health & Fitness': 11.5,
 'House & Home': 9.45,
 'Libraries & Demo': 3.1,
 'Lifestyle': 9.6,
 'Maps & Navigation': 9.8,
 'Medical': 15.0,
 'Music': 36.0,
 'Music & Audio': 9.8,
 'Music & Video': 28.0,
 'News & Magazines': 9.0,
 'Parenting': 11.0,
 'Personalization': 7.1,
 'Photography': 9.649999999999999,
 'Pretend Play': 48.5,
 'Productivity': 6.9,
 'Puzzle': 29.0,
 'Racing': 45.0,
 'Role Playing': 47.0,
 'Shopping': 12.0,
 'Simulation': 41.0,
 'Social': 7.9,
 'Sports': 20.0,
 'Strategy': 45.5,
 'Tools': 4.2,
 '

In [1620]:
df_cleaned.loc[is_size_na, SIZE] = df_cleaned.loc[is_size_na, GENRE].map(genre_median_size_map)
df_cleaned[SIZE].isna().sum()

0

#### Classification of size to Size Class {Low, Medium, High}

In [1621]:
q1, q2 = df_cleaned[SIZE].quantile(q=[1/3, 2/3])
df_cleaned[SIZE_CLASS] = df_cleaned[SIZE].apply(lambda size: "Low" if size <= q1 else "Medium" if size <= q2 else "High")
df_cleaned[SIZE_CLASS].value_counts()

Medium    3435
Low       3363
High      3254
Name: Size Class, dtype: int64

In [1622]:
df_cleaned[APP_NAME].describe()

count                              10052
unique                              9659
top       Kids Learn Languages by Mondly
freq                                   2
Name: App, dtype: object

The reason why there is no equal distribution of the size classes is, that we duplicated some rows and so we got duplicates in the size columns. <br>
But thats fine, because the fact, that the genre is different, we've got an new record and we still can remove the duplicates by the app name column, if needed.

### Downloads

In [1623]:
df_cleaned[DOWNLOADS].value_counts()

1,000,000+        1519
100,000+          1166
10,000+           1061
10,000,000+        999
1,000+             906
100+               716
5,000,000+         649
500,000+           538
50,000+            487
5,000+             477
10+                387
500+               332
50,000,000+        208
50+                206
100,000,000+       192
5+                  82
1+                  68
500,000,000+        24
1,000,000,000+      20
0+                  14
0                    1
Name: Downloads, dtype: int64

In [1624]:
df_cleaned[DOWNLOADS] = df_cleaned[DOWNLOADS].str.replace("+", "", regex=True).replace(" ", "", regex=True).replace(",", "", regex=True).astype("int")
df_cleaned[DOWNLOADS].value_counts().sort_index()

0               15
1               68
5               82
10             387
50             206
100            716
500            332
1000           906
5000           477
10000         1061
50000          487
100000        1166
500000         538
1000000       1519
5000000        649
10000000       999
50000000       208
100000000      192
500000000       24
1000000000      20
Name: Downloads, dtype: int64

#### Classification of downloads to {Low, Medium, High, Top}

In [1625]:
q1, q2, q3, qtop10 = df_cleaned[DOWNLOADS].quantile(q=[1/4, 2/4, 3/4, 9/10])

df_cleaned[DOWNLOAD_RATE] = df_cleaned[DOWNLOADS].apply(lambda downloads: "Top" if downloads >= qtop10 else "High" if downloads >= q3 else "Medium" if downloads >= q2 else "Low")
df_cleaned[DOWNLOAD_RATE].value_counts()

Low       4737
High      2168
Medium    1704
Top       1443
Name: Download Rate, dtype: int64

### Type

In [1626]:
df_cleaned[TYPE].describe()

count     10052
unique        2
top        Free
freq       9226
Name: Type, dtype: object

In [1627]:
df_cleaned[TYPE].isna().sum()

0

In [1628]:
df_cleaned[TYPE].value_counts()

Free    9226
Paid     826
Name: Type, dtype: int64

Looks fine!

### Price

In [1629]:
df_cleaned[PRICE].describe()

count     10052
unique       92
top           0
freq       9226
Name: Price, dtype: object

In [1630]:
df_cleaned[PRICE].isna().sum()

0

In [1631]:
df_cleaned[PRICE].value_counts()

0          9226
$2.99       151
$0.99       150
$1.99        80
$4.99        79
           ... 
$389.99       1
$19.90        1
$1.75         1
$14.00        1
$1.04         1
Name: Price, Length: 92, dtype: int64

In [1632]:
df_cleaned[PRICE] = df_cleaned[PRICE].str.replace("$", "", regex=True).astype("float64")
df_cleaned[PRICE].describe()

count    10052.000000
mean         1.079390
std         16.521963
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        400.000000
Name: Price, dtype: float64

In [1633]:
is_price_over_0 = df_cleaned[PRICE] > 0
df_cleaned.loc[is_price_over_0, PRICE].quantile(q=[1/3, 2/3, 9/10, 99/100])

Flushing oldest 200 entries.
  warn('Output cache limit (currently {sz} entries) hit.\n'


0.333333      1.99
0.666667      3.99
0.900000      9.99
0.990000    399.99
Name: Price, dtype: float64

In [1634]:
q1, q2, qtop10 = df_cleaned.loc[is_price_over_0, PRICE].quantile(q=[1/3, 2/3, 9/10])

df_cleaned[PRICE_CLASS] = df_cleaned[PRICE].apply(lambda price: "Very High" if price >= qtop10 else "Low" if price <= q1 else "Medium" if price <= q2 else "High")
df_cleaned.loc[is_price_over_0, PRICE_CLASS].value_counts()

Low          294
Medium       273
High         164
Very High     95
Name: Price Class, dtype: int64

### Content Rating

In [1635]:
df_cleaned[CONTENT_GROUP].isna().sum()

0

In [1636]:
df_cleaned[CONTENT_GROUP].value_counts()

Everyone           8272
Teen               1036
Mature 17+          393
Everyone 10+        346
Adults only 18+       3
Unrated               2
Name: Content Group, dtype: int64

- Mature 17+ -> Adults
- Adults only 18+ -> Adults
- Everyone 10+ -> Everyone
- Unrated -> Everyone

In [1637]:
content_rating_map = {"Mature 17+": "Adults", "Adults only 18+": "Adults", "Everyone 10+": "Everyone", "Unrated": "Everyone"}
df_cleaned[CONTENT_GROUP] = df_cleaned[CONTENT_GROUP].replace(content_rating_map)
df_cleaned[CONTENT_GROUP].value_counts()

Everyone    8620
Teen        1036
Adults       396
Name: Content Group, dtype: int64

### Clean Dataset

In [1638]:
df_cleaned

Unnamed: 0,App,Rating,Reviews,Size,Downloads,Type,Price,Content Group,Genres,Rating Class,Review Rate,Size Class,Download Rate,Price Class
0,Photo Editor & Candy Camera & Grid & ScrapBook,4.1,159,19.0,10000,Free,0.0,Everyone,Art & Design,4.5,Medium,Medium,Low,Low
1,Coloring book moana,3.9,967,14.0,500000,Free,0.0,Everyone,Art & Design,4.0,Medium,Medium,Medium,Low
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",4.7,87510,8.7,5000000,Free,0.0,Everyone,Art & Design,0.5,High,Medium,High,Low
3,Sketch - Draw & Paint,4.5,215644,25.0,50000000,Free,0.0,Teen,Art & Design,0.5,High,High,Top,Low
4,Pixel Draw - Number Art Coloring Book,4.3,967,2.8,100000,Free,0.0,Everyone,Art & Design,4.5,Medium,Low,Medium,Low
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10047,PBS KIDS Games,4.3,12919,94.0,1000000,Free,0.0,Everyone,Education,4.5,High,High,High,Low
10048,Dolphin and fish coloring book,3.9,2249,28.0,500000,Free,0.0,Everyone,Creativity,4.0,Medium,High,Medium,Low
10049,Cake Shop - Kids Cooking,4.3,30668,33.0,5000000,Free,0.0,Everyone,Pretend Play,4.5,High,High,High,Low
10050,Hair saloon - Spa salon,4.2,38473,23.0,10000000,Free,0.0,Everyone,Pretend Play,4.5,High,High,Top,Low


## Data visualization & interpretation