# **Google Play Store Complete EDA**

---

**Author Name:** Tayyab Riaz\
**Date:** 16-10-2023\
**Email:** m.tayyab.riaz@outlook.com\
**Data Reference:** [Kaggle](https://www.kaggle.com/datasets/lava18/google-play-store-apps/)

---

# **About Data Set**
---

>- **`Context`**

While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

>- **`Content`**

Each app (row) has values for catergory, rating, size, and more.

>- **`Acknowledgements`**

This information is scraped from the Google Play Store. This app information would not be available without it.

>- **`Inspiration`**

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

---

&nbsp;\
&nbsp;\
&nbsp;\
&nbsp;\
&nbsp;

---
##### **`1) Importing Libraries`**
---

In [379]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

&nbsp;\
&nbsp;\
&nbsp;

---
##### **`2) Some Importand things`**
---

&nbsp;

> **Note**: Some output of notebook does not present the complete output, therefore we can increase the limit of columns view and row view by using these following commands.
- To show all Rows and columns of data in output

In [380]:
pd.set_option('display.max_columns',None)   # For showing all columns 
pd.set_option('display.max_rows',None)   # For showing all rows 

&nbsp;

- Ignoring Warning in Notebook

In [381]:
import warnings
warnings.filterwarnings('ignore')

&nbsp;

---
##### **`3) Loading Dataset | Exploration | Cleaning`**
---

 ↪ Load the csv file with the pandas
 
 ↪ creating the dataframe and understanding the data present in the dataset using pandas
 
 ↪ Dealing with the missing data, outliers and the incorrect records

In [382]:
df=pd.read_csv('googleplaystore.csv')

&nbsp;

- Looking at first 5 rows 

In [383]:
df.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


&nbsp;

- Total columns in data with name

In [384]:
print('\nAll names of columns in data are') 
print('-'*70)
print(df.columns) 
print('-'*70)


All names of columns in data are
----------------------------------------------------------------------
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')
----------------------------------------------------------------------


&nbsp;

- Data Shape i.e Total rows and columns (rows, columns)

In [385]:
print('\n')
print(df.shape)
print('\n') 
print('-'*70)  
print(f'Total number of rows in data : {df.shape[0]} rows')   
print(f'Total number of columns in data : {df.shape[1]} columns') 
print('-'*70)   
  



(10841, 13)


----------------------------------------------------------------------
Total number of rows in data : 10841 rows
Total number of columns in data : 13 columns
----------------------------------------------------------------------


&nbsp;

- All information of Data, let's have a look on the columns and their data types using detailed info function.

In [386]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10841 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10839 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


&nbsp;

---
# **Observations**
---
1. There are 10841 rows and 13 columns in the dataset
2. The columns are of different data types.
3. The columns in the datasets are:

 - `App` 
 - `Category`
 - `Rating`
 - `Reviews`
 - `Size`
 - `Installs`
 - `Type`
 - `Price`
 - `Content Rating`
 - `Genres`
 - `Last Updated`
 - `Current Ver`
 - `Android Ver`
4. There are some missing values in the dataset which we will read in details and deal later on in the notebook.
5. There are some columns which are of object data type but they should be of numeric data type, we will convert them later on in the notebook.
 - **`Size`**
 - **`Installs`**
 - **`Price`** 
---

&nbsp;


#### **1) Size column**

&nbsp;

- Checking for any null values in Size column

In [387]:
print('Total numbers of nulls values in size column are: ',df['Size'].isnull().sum())

Total numbers of nulls values in size column are:  0


&nbsp;

- Checking Unique Values

In [388]:
print('\n') 
print('-'*70)  
print('Total number of unique values in size column is : ', df['Size'].nunique())
print('-'*70)
print('\nUnique values in size column are')
print('-'*70)
print(df['Size'].unique())



----------------------------------------------------------------------
Total number of unique values in size column is :  461
----------------------------------------------------------------------

Unique values in size column are
----------------------------------------------------------------------
['19M' '14M' '8.7M' '25M' '2.8M' '5.6M' '29M' '33M' '3.1M' '28M' '12M'
 '20M' '21M' '37M' '2.7M' '5.5M' '17M' '39M' '31M' '4.2M' '7.0M' '23M'
 '6.0M' '6.1M' '4.6M' '9.2M' '5.2M' '11M' '24M' 'Varies with device'
 '9.4M' '15M' '10M' '1.2M' '26M' '8.0M' '7.9M' '56M' '57M' '35M' '54M'
 '201k' '3.6M' '5.7M' '8.6M' '2.4M' '27M' '2.5M' '16M' '3.4M' '8.9M'
 '3.9M' '2.9M' '38M' '32M' '5.4M' '18M' '1.1M' '2.2M' '4.5M' '9.8M' '52M'
 '9.0M' '6.7M' '30M' '2.6M' '7.1M' '3.7M' '22M' '7.4M' '6.4M' '3.2M'
 '8.2M' '9.9M' '4.9M' '9.5M' '5.0M' '5.9M' '13M' '73M' '6.8M' '3.5M'
 '4.0M' '2.3M' '7.2M' '2.1M' '42M' '7.3M' '9.1M' '55M' '23k' '6.5M' '1.5M'
 '7.5M' '51M' '41M' '48M' '8.5M' '46M' '8.3M' '4.3M' '4.7M

&nbsp;

---
# **Observation**
---
In size columns basically there are 3 types of values.
- varies with device.
- (int/float)M
- (int/float)k\
&nbsp;
>- **`Important Note:`** There are several uniques values in the `Size` column, we have to first make the unit into one common unit from M and K to bytes, and then remove the `M` and `K` from the values and convert them into numeric data type.
---

&nbsp;

- Verifying total numbers of these 3 differnt types

In [389]:
print('\n')
print('-'*70)
print('Total number of values that have \'M\':',df['Size'].loc[df['Size'].str.contains('M')].value_counts().sum())
print('Total number of values that have \'k\':',df['Size'].loc[df['Size'].str.contains('k')].value_counts().sum())
print('Total number of values that have \'Varies with device\':',df['Size'].loc[df['Size'].str.contains('Varies with device')].value_counts().sum())
print('-'*70)



----------------------------------------------------------------------
Total number of values that have 'M': 8830
Total number of values that have 'k': 316
Total number of values that have 'Varies with device': 1695
----------------------------------------------------------------------


&nbsp;

- Convert all sizes into bytes in Size column

In [390]:
def convert_size(size):
     if isinstance(size,str):
        if 'k' in size:
           return float(size.replace('k',''))*1024
        elif 'M' in size:
           return float(size.replace('M',''))*1024*1024
        elif 'Varies with device' in size:
           return np.nan
     return size
   

In [391]:
df['Size']=df['Size'].apply(convert_size)

&nbsp;

- convert all sizes into Mbs in Size column

In [392]:
df['Size']=df['Size'].apply(lambda x: x /(1024*1024))

&nbsp;

- Renaming size column

In [393]:
df.rename(columns={'Size':'Size_in_mb'}, inplace=True)

&nbsp;

#### **2) Installs column**

&nbsp;

- Checking for any null values in Installs column

In [394]:
print('Total numbers of nulls values in Installs column are: ',df['Installs'].isnull().sum())

Total numbers of nulls values in Installs column are:  0


In [395]:
print('\n') 
print('-'*70)  
print('Total number of unique values in Installs column is : ', df['Installs'].nunique())
print('-'*70)
print('\nUnique values in Installs column are')
print('-'*70)
print(df['Installs'].unique())



----------------------------------------------------------------------
Total number of unique values in Installs column is :  21
----------------------------------------------------------------------

Unique values in Installs column are
----------------------------------------------------------------------
['10,000+' '500,000+' '5,000,000+' '50,000,000+' '100,000+' '50,000+'
 '1,000,000+' '10,000,000+' '5,000+' '100,000,000+' '1,000,000,000+'
 '1,000+' '500,000,000+' '50+' '100+' '500+' '10+' '1+' '5+' '0+' '0']


&nbsp;

---
# **Observation**

---

- The total values in the `Installs` column are `10841` and there are no null values in the column.
- The only problem I see here is the `+` sign in the values, let's remove them and convert the column into numeric data type. However, one value `0` has no plus sign.

- Let's remove the plus sign `+` and `,` from the values and convert them into numeric data type

---

&nbsp;

- Verifying total numbers of this type

In [396]:
print('\n')
print('-'*70)
print('Total number of values that have \'+\':',df['Installs'].loc[df['Installs'].str.contains('\+')].value_counts().sum())
print('Total number of values that have \',\':',df['Installs'].loc[df['Installs'].str.contains('\,')].value_counts().sum())
print('-'*70)



----------------------------------------------------------------------
Total number of values that have '+': 10840
Total number of values that have ',': 9037
----------------------------------------------------------------------


&nbsp;

- Removing , and +

In [406]:
def removing_unwanted(size):
     if isinstance(size,str):
        if '+' in size  :
           return size.replace('+','')
        if ',' in size:
           return size.replace(',','')
     return size
   

In [407]:
df['Installs']=df['Installs'].apply(removing_unwanted)

&nbsp;

- Converting Installs column  type into numerical type

In [408]:
df['Installs']=df['Installs'].apply(lambda x:int(x))

&nbsp;

- Making a new column called **`Installs_category`** which will have the category of the installs.

In [None]:

bins = [-1, 0, 10, 1000, 10000, 100000, 1000000, 10000000, 10000000000]
labels = ['No Installs','very Few Installs','Low Installs','Moderate Installs','low-Moderate Installs','Moderate-High Installs','High Installs','Very High Installs']
df['Installs_category']=pd.cut(df['Installs'],bins=bins,labels=labels)

&nbsp;

#### **3) Price column**

&nbsp;

- Checking for any null values in Price column

In [None]:
print('Total numbers of nulls values in Price column are: ',df['Price'].isnull().sum())

Total numbers of nulls values in Price column are:  0


In [None]:
print('\n') 
print('-'*70)  
print('Total number of unique values in Price column is : ', df['Price'].nunique())
print('-'*70)
print('\nUnique values in Price  column are')
print('-'*70)
print(df['Price'].unique())



----------------------------------------------------------------------
Total number of unique values in Price column is :  92
----------------------------------------------------------------------

Unique values in Price  column are
----------------------------------------------------------------------
['0' '$4.99' '$3.99' '$6.99' '$1.49' '$2.99' '$7.99' '$5.99' '$3.49'
 '$1.99' '$9.99' '$7.49' '$0.99' '$9.00' '$5.49' '$10.00' '$24.99'
 '$11.99' '$79.99' '$16.99' '$14.99' '$1.00' '$29.99' '$12.99' '$2.49'
 '$10.99' '$1.50' '$19.99' '$15.99' '$33.99' '$74.99' '$39.99' '$3.95'
 '$4.49' '$1.70' '$8.99' '$2.00' '$3.88' '$25.99' '$399.99' '$17.99'
 '$400.00' '$3.02' '$1.76' '$4.84' '$4.77' '$1.61' '$2.50' '$1.59' '$6.49'
 '$1.29' '$5.00' '$13.99' '$299.99' '$379.99' '$37.99' '$18.99' '$389.99'
 '$19.90' '$8.49' '$1.75' '$14.00' '$4.85' '$46.99' '$109.99' '$154.99'
 '$3.08' '$2.59' '$4.80' '$1.96' '$19.40' '$3.90' '$4.59' '$15.46' '$3.04'
 '$4.29' '$2.60' '$3.28' '$4.60' '$28.99' '$2.95' '

In [400]:
print('\n')
print('-'*70)
print('Total number of values that have \'$\':',df['Price'].loc[df['Price'].str.contains('\$')].value_counts().sum())
print('Total number of values that have \'0\':',df['Price'].loc[(df['Price'].str.contains('0')) & (~df['Price'].str.contains('\$'))].value_counts().sum())

print('-'*70)



----------------------------------------------------------------------
Total number of values that have '$': 800
Total number of values that have '0': 10041
----------------------------------------------------------------------


&nbsp;

---
# **Observation**

---

- The total values in the `Price` column are `10841` . There are `10041` values that have `0` but not have `$` and there are total `800` values that have `$` sign and there are no null values in the column.
- The only problem I see here is the `$` sign in the values, let's remove it and convert the column into numeric data type. However, one value `0` has no `$` sign.

---

&nbsp;

- Remove `$` sign in Data.

In [410]:
df['Price']=df['Price'].apply(lambda x:x.replace('$','') if '$' in str(x) else x)

In [411]:
df['Price'].unique()
df['Price'].value_counts()

Price
0         10041
0.99        148
2.99        129
1.99         73
4.99         72
3.99         63
1.49         46
5.99         30
2.49         26
9.99         21
6.99         13
399.99       12
14.99        11
4.49          9
29.99         7
24.99         7
3.49          7
7.99          7
5.49          6
19.99         6
11.99         5
6.49          5
12.99         5
8.99          5
10.00         3
16.99         3
1.00          3
2.00          3
13.99         2
8.49          2
17.99         2
1.70          2
3.95          2
79.99         2
7.49          2
9.00          2
10.99         2
39.99         2
33.99         2
1.96          1
19.40         1
4.80          1
3.28          1
4.59          1
15.46         1
3.04          1
4.29          1
2.60          1
2.59          1
3.90          1
154.99        1
4.60          1
28.99         1
2.95          1
2.90          1
1.97          1
200.00        1
89.99         1
2.56          1
1.20          1
1.26          1
30.99         1
3.

In [412]:
df['Price']=df['Price'].apply(lambda x:float(x))

In [413]:
df.describe()

Unnamed: 0,Rating,Reviews,Size_in_mb,Installs,Price
count,9367.0,10841.0,9146.0,10841.0,10841.0
mean,4.191513,444111.9,21.514141,15462910.0,1.027273
std,0.515735,2927629.0,22.588679,85025570.0,15.948971
min,1.0,0.0,0.008301,0.0,0.0
25%,4.0,38.0,4.9,1000.0,0.0
50%,4.3,2094.0,13.0,100000.0,0.0
75%,4.5,54768.0,30.0,5000000.0,0.0
max,5.0,78158310.0,100.0,1000000000.0,400.0


##### **`1) Importing Libraries`**

##### **`1) Importing Libraries`**

##### **`1) Importing Libraries`**

##### **`1) Importing Libraries`**

##### **`1) Importing Libraries`**