# **Google Play Store Apps Analysis**

**Author Name:** Tayyab Riaz\
**Date:** 16-10-2023\
**Email:** m.tayyab.riaz@outlook.com\
**Data Reference:** [Kaggle](https://www.kaggle.com/datasets/lava18/google-play-store-apps/)

# **About Data Set**

**`Context`**

While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

**`Content`**

Each app (row) has values for catergory, rating, size, and more.

**`Acknowledgements`**

This information is scraped from the Google Play Store. This app information would not be available without it.

**`Inspiration`**

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

**1) Importing Libraries**

In [14]:
import pandas as pd
import ydata_profiling as yd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
df=pd.read_csv('/home/tayyab/Documents/AI_Course/python_for_data_science/pandas/data_set/google_play_store/googleplaystore.csv')

In [16]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

**`1) App: `** App is name of the app.\
**`2) Category: `** Category is about app category.\
**`3) Rating: `** Rating is about app rating out of 5.\
**`4) Reviews : `** Reviews  is total number of reviews given by users.\
**`5) Size : `** Size is the size of app in Mb.\
**`6) Installs : `** Installs is the total downloaders of the app.\
**`7) Type : `** Type is the that app is free or paid.\
**`8) Price : `** Price is the total price of the app.\
**`9) Content Rating : `** Content Rating is that is app is for everyone or restricted person.\
**`10) Genres : `** Genres is category the  of the app.\
**`11) Last Updated : `** Last Updated is that when the app is updated.\
**`12) Current Ver : `** Current Ver is latest version of the app.\
**`13) Android Ver : `** Android Ver is that on which version of android app works.



**2) ydata-Profiling Analysis**

In [6]:
profile=yd.ProfileReport(df)
profile.to_file(output_file='google_playstore.html')

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [17]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [18]:
df.Size

0                       19M
1                       14M
2                      8.7M
3                       25M
4                      2.8M
                ...        
10836                   53M
10837                  3.6M
10838                  9.5M
10839    Varies with device
10840                   19M
Name: Size, Length: 10841, dtype: object

In [19]:
df['Size'].nunique

<bound method IndexOpsMixin.nunique of 0                       19M
1                       14M
2                      8.7M
3                       25M
4                      2.8M
                ...        
10836                   53M
10837                  3.6M
10838                  9.5M
10839    Varies with device
10840                   19M
Name: Size, Length: 10841, dtype: object>

In [20]:
# def convert_to_mb(size):
#     size_str = str(size)
#     if 'M' in size_str:
#         return float(size_str.replace('M', ''))
#     elif 'K' in size_str:
#         return float(size_str.replace('K', ''))/1024
#     else:
#         return float(size_str)

# df['Size'] = df['Size'].apply(convert_to_mb)

def convert_to_mb(size):
    size_str = str(size)
    if 'M' in size_str:
        return float(size_str.replace('M', ''))
    elif 'K' in size_str:
        return float(size_str.replace('K', ''))/1024
    elif size == 'Varies with device':
        return None  # or instead of None, return a valid default value
    else:
        return float(size_str)

df['Size'] = df['Size'].apply(convert_to_mb)



ValueError: could not convert string to float: '201k'

In [21]:
import pandas as pd

def convert_to_mb(size):
    size_str = str(size)
    if 'M' in size_str:
        return float(size_str.replace('M', ''))
    elif 'K' in size_str or 'k' in size_str:
        return float(size_str.upper().replace('K', ''))/1024
    elif size == 'Varies with device':
        return None  # or instead of None, return a valid default value
    else:
        return float(size_str)

df['Size'] = df['Size'].apply(convert_to_mb)


ValueError: could not convert string to float: '1,000+'

In [22]:
import pandas as pd

def convert_to_mb(size):
    try:
        size_str = str(size)
        if 'M' in size_str:
            return float(size_str.replace('M', ''))
        elif 'K' in size_str or 'k' in size_str:
            return float(size_str.upper().replace('K', ''))/1024
        elif size == 'Varies with device':
            return None
        else:
            return float(size_str)
    except ValueError:
        return None

df['Size'] = df['Size'].apply(convert_to_mb)


In [24]:
df['Size'].unique()

array([1.90000000e+01, 1.40000000e+01, 8.70000000e+00, 2.50000000e+01,
       2.80000000e+00, 5.60000000e+00, 2.90000000e+01, 3.30000000e+01,
       3.10000000e+00, 2.80000000e+01, 1.20000000e+01, 2.00000000e+01,
       2.10000000e+01, 3.70000000e+01, 2.70000000e+00, 5.50000000e+00,
       1.70000000e+01, 3.90000000e+01, 3.10000000e+01, 4.20000000e+00,
       7.00000000e+00, 2.30000000e+01, 6.00000000e+00, 6.10000000e+00,
       4.60000000e+00, 9.20000000e+00, 5.20000000e+00, 1.10000000e+01,
       2.40000000e+01,            nan, 9.40000000e+00, 1.50000000e+01,
       1.00000000e+01, 1.20000000e+00, 2.60000000e+01, 8.00000000e+00,
       7.90000000e+00, 5.60000000e+01, 5.70000000e+01, 3.50000000e+01,
       5.40000000e+01, 1.96289062e-01, 3.60000000e+00, 5.70000000e+00,
       8.60000000e+00, 2.40000000e+00, 2.70000000e+01, 2.50000000e+00,
       1.60000000e+01, 3.40000000e+00, 8.90000000e+00, 3.90000000e+00,
       2.90000000e+00, 3.80000000e+01, 3.20000000e+01, 5.40000000e+00,
      

In [25]:
import pandas as pd

def convert_to_mb(size):
    try:
        size_str = str(size)
        if 'M' in size_str:
            return round(float(size_str.replace('M', '')), 2)
        elif 'K' in size_str or 'k' in size_str:
            return round(float(size_str.upper().replace('K', ''))/1024, 2)
        elif size == 'Varies with device':
            return None  # or a suitable value based on your dataset
        else:
            return round(float(size_str), 2)
    except ValueError:
        return None

df['Size'] = df['Size'].apply(convert_to_mb)


In [28]:
df['Size'].nunique()

273

In [29]:
# Fill missing values with median
median_size = df['Size'].median()
df['Size'].fillna(median_size, inplace=True)

# or fill with mean
# mean_size = df['Size'].mean()
# df['Size'].fillna(mean_size, inplace=True)


In [31]:
df['Size'].unique()

array([1.9e+01, 1.4e+01, 8.7e+00, 2.5e+01, 2.8e+00, 5.6e+00, 2.9e+01,
       3.3e+01, 3.1e+00, 2.8e+01, 1.2e+01, 2.0e+01, 2.1e+01, 3.7e+01,
       2.7e+00, 5.5e+00, 1.7e+01, 3.9e+01, 3.1e+01, 4.2e+00, 7.0e+00,
       2.3e+01, 6.0e+00, 6.1e+00, 4.6e+00, 9.2e+00, 5.2e+00, 1.1e+01,
       2.4e+01, 1.3e+01, 9.4e+00, 1.5e+01, 1.0e+01, 1.2e+00, 2.6e+01,
       8.0e+00, 7.9e+00, 5.6e+01, 5.7e+01, 3.5e+01, 5.4e+01, 2.0e-01,
       3.6e+00, 5.7e+00, 8.6e+00, 2.4e+00, 2.7e+01, 2.5e+00, 1.6e+01,
       3.4e+00, 8.9e+00, 3.9e+00, 2.9e+00, 3.8e+01, 3.2e+01, 5.4e+00,
       1.8e+01, 1.1e+00, 2.2e+00, 4.5e+00, 9.8e+00, 5.2e+01, 9.0e+00,
       6.7e+00, 3.0e+01, 2.6e+00, 7.1e+00, 3.7e+00, 2.2e+01, 7.4e+00,
       6.4e+00, 3.2e+00, 8.2e+00, 9.9e+00, 4.9e+00, 9.5e+00, 5.0e+00,
       5.9e+00, 7.3e+01, 6.8e+00, 3.5e+00, 4.0e+00, 2.3e+00, 7.2e+00,
       2.1e+00, 4.2e+01, 7.3e+00, 9.1e+00, 5.5e+01, 2.0e-02, 6.5e+00,
       1.5e+00, 7.5e+00, 5.1e+01, 4.1e+01, 4.8e+01, 8.5e+00, 4.6e+01,
       8.3e+00, 4.3e