In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Part I - Basic Data Exploration

Load the dataset

In [2]:
path_to_dataset = "googleplaystore.csv"
google_play_store_df = pd.read_csv(path_to_dataset)

Display first 10 samples, using `.head(`) method

In [3]:
google_play_store_df.head(n=10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


In EDA projects, I always like to see the column names using `.columns` attribute, rather than seeing it from table above.

In [4]:
google_play_store_df.columns.to_list()

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [5]:
google_play_store_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [6]:
google_play_store_df.describe()

Unnamed: 0,Rating
count,9367.0
mean,4.193338
std,0.537431
min,1.0
25%,4.0
50%,4.3
75%,4.5
max,19.0


In [7]:
num_samples = len(google_play_store_df)
print(f'There are {num_samples} samples in the dataset.')

There are 10841 samples in the dataset.


# Part II - Missing Values and Data Types

As I inspect the results of `google_play_store_df.head(n=10)` and `google_play_store_df.info()`, I observed the following:

- The first thing my eye caught is that there is only one feature kept as `float64` `(Rating`)

- Every other features are kept as `object`. I know that the pandas keep `str` datatype as `object` in `dataframes`. However, even though the `Price` is something that is numerical in our life, it stored as `object`. Moreover, there are more quantative features as well, like `Reviews`, `Size` and `Installs`. Thanks to the flexibility of Python, (if needed), I can convert those into numerical (e.g., I can define a function that maps the "19M" to 19.000.000.000).

- There are total of 10841 rows in the dataset. However, there are some missing values, based on the results of `.info()` method call. For example, there are 10841-9367=1474 missing values for `Rating` column.

## 2.1. Check missing values

In [8]:
# I count missing values in each column of dataset, and store them in a dictionary
# keys: column names, corresponding values: count of missing values in that column
missing_value_count_dict = {}

for col in google_play_store_df.columns:
  missing_value_count_dict[col] = int(google_play_store_df[col].isna().sum())

missing_value_count_dict

{'App': 0,
 'Category': 0,
 'Rating': 1474,
 'Reviews': 0,
 'Size': 0,
 'Installs': 0,
 'Type': 1,
 'Price': 0,
 'Content Rating': 1,
 'Genres': 0,
 'Last Updated': 0,
 'Current Ver': 8,
 'Android Ver': 3}

There are lots of missing values for Rating column. Let's find the percentage.

In [9]:
missing_rating_percentage = google_play_store_df["Rating"].isna().sum() / len(google_play_store_df) * 100
missing_rating_percentage = round(float(missing_rating_percentage), 2)
missing_rating_percentage

13.6

Also let's take a look at the other columns with missing values

In [10]:
# List of column names with missing values
keys_missing_values = [key for key in missing_value_count_dict.keys() if missing_value_count_dict[key] != 0]

# Remove "Rating" from the list
keys_missing_values.remove("Rating")

keys_missing_values

['Type', 'Content Rating', 'Current Ver', 'Android Ver']

In [11]:
google_play_store_df[google_play_store_df["Type"].isna()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


In [12]:
google_play_store_df[google_play_store_df["Content Rating"].isna()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [13]:
google_play_store_df[google_play_store_df["Current Ver"].isna()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
15,Learn To Draw Kawaii Characters,ART_AND_DESIGN,3.2,55,2.7M,"5,000+",Free,0,Everyone,Art & Design,"June 6, 2018",,4.2 and up
1553,Market Update Helper,LIBRARIES_AND_DEMO,4.1,20145,11k,"1,000,000+",Free,0,Everyone,Libraries & Demo,"February 12, 2013",,1.5 and up
6322,Virtual DJ Sound Mixer,TOOLS,4.2,4010,8.7M,"500,000+",Free,0,Everyone,Tools,"May 10, 2017",,4.0 and up
6803,BT Master,FAMILY,,0,222k,100+,Free,0,Everyone,Education,"November 6, 2016",,1.6 and up
7333,Dots puzzle,FAMILY,4.0,179,14M,"50,000+",Paid,$0.99,Everyone,Puzzle,"April 18, 2018",,4.0 and up
7407,Calculate My IQ,FAMILY,,44,7.2M,"10,000+",Free,0,Everyone,Entertainment,"April 3, 2017",,2.3 and up
7730,UFO-CQ,TOOLS,,1,237k,10+,Paid,$0.99,Everyone,Tools,"July 4, 2016",,2.0 and up
10342,La Fe de Jesus,BOOKS_AND_REFERENCE,,8,658k,"1,000+",Free,0,Everyone,Books & Reference,"January 31, 2017",,3.0 and up


In [14]:
google_play_store_df[google_play_store_df["Android Ver"].isna()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
4453,[substratum] Vacuum: P,PERSONALIZATION,4.4,230,11M,"1,000+",Paid,$1.49,Everyone,Personalization,"July 20, 2018",4.4,
4490,Pi Dark [substratum],PERSONALIZATION,4.5,189,2.1M,"10,000+",Free,0,Everyone,Personalization,"March 27, 2018",1.1,
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [15]:
# Display rows with any missing values
google_play_store_df[google_play_store_df.isna().any(axis=1)]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
15,Learn To Draw Kawaii Characters,ART_AND_DESIGN,3.2,55,2.7M,"5,000+",Free,0,Everyone,Art & Design,"June 6, 2018",,4.2 and up
23,Mcqueen Coloring pages,ART_AND_DESIGN,,61,7.0M,"100,000+",Free,0,Everyone,Art & Design;Action & Adventure,"March 7, 2018",1.0.0,4.1 and up
113,Wrinkles and rejuvenation,BEAUTY,,182,5.7M,"100,000+",Free,0,Everyone 10+,Beauty,"September 20, 2017",8.0,3.0 and up
123,Manicure - nail design,BEAUTY,,119,3.7M,"50,000+",Free,0,Everyone,Beauty,"July 23, 2018",1.3,4.1 and up
126,Skin Care and Natural Beauty,BEAUTY,,654,7.4M,"100,000+",Free,0,Teen,Beauty,"July 17, 2018",1.15,4.1 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10824,Cardio-FR,MEDICAL,,67,82M,"10,000+",Free,0,Everyone,Medical,"July 31, 2018",2.2.2,4.4 and up
10825,Naruto & Boruto FR,SOCIAL,,7,7.7M,100+,Free,0,Teen,Social,"February 2, 2018",1.0,4.0 and up
10831,payermonstationnement.fr,MAPS_AND_NAVIGATION,,38,9.8M,"5,000+",Free,0,Everyone,Maps & Navigation,"June 13, 2018",2.0.148.0,4.0 and up
10835,FR Forms,BUSINESS,,0,9.6M,10+,Free,0,Everyone,Business,"September 29, 2016",1.1.5,4.0 and up


## 2.2. Identify data types

In [16]:
google_play_store_df.dtypes

Unnamed: 0,0
App,object
Category,object
Rating,float64
Reviews,object
Size,object
Installs,object
Type,object
Price,object
Content Rating,object
Genres,object


## 2.3. Check if each column has an appropriate data type.

In [17]:
google_play_store_df.head(1)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up


• The first thing my eye caught is that there is only one feature kept as `float64` (Rating)


• Every other features are kept as `object`. I know that the pandas keep `str` datatype as `object` in dataframes. However, even though the Price is something that is numerical in our life, it stored as `object`. Moreover, there are more quantative features as well, like Reviews, Size and Installs. Thanks to the flexibility of Python, (if needed), I can convert those into numerical (e.g., I can define a function that maps the "19M" to 19.000.000.000).

## 2.4. Inspect categorical vs numerical columns

In [18]:
# Create list of numerical and categorical columns
numerical_columns = google_play_store_df.select_dtypes(include="number").columns.to_list()
categorical_columns = google_play_store_df.select_dtypes(exclude="number").columns.to_list()

In [19]:
numerical_columns

['Rating']

In [20]:
categorical_columns

['App',
 'Category',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [21]:
google_play_store_df[numerical_columns].head(3)

Unnamed: 0,Rating
0,4.1
1,3.9
2,4.7


In [22]:
google_play_store_df[categorical_columns].head(3)

Unnamed: 0,App,Category,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up


## 2.5. Verify which columns are suitable for numerical analysis or visualization

## 2.6. Handle incorrect datatypes

## 2.7. Discuss your findings:

- Is the column categorical or numerical?
- Are there missing or zero values that might need cleaning?
- Does the column need conversion before analysis?

## 2.8. Handle missing values
Decide whether to fill missing values (e.g., with median or mode), or drop rows/columns with too many missing entries.

# Part 3 - Descriptive Statistics

# Part 4 - Research Queries and Analytical Insights

------