## 0. Setting up the notebook

#### 0.1 - Installing the necessary libraries

In [3]:
%pip install python-dotenv
%pip install opendatasets
%pip install kaggle
%pip install pandas matplotlib seaborn


Collecting python-dotenv
  Obtaining dependency information for python-dotenv from https://files.pythonhosted.org/packages/6a/3e/b68c118422ec867fa7ab88444e1274aa40681c606d59ac27de5a5588f082/python_dotenv-1.0.1-py3-none-any.whl.metadata
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Collecting tqdm (from opendatasets)
  Obtaining dependency information for tqdm from https://files.pythonhosted.org/packages/00/e5/f12a80907d0884e6dff9

#### 0.2 - Importing the necessary libraries

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#### 0.3 - Downloading the dataset via opendatasets

In [1]:
import opendatasets as od 
dataset_url = 'https://www.kaggle.com/datasets/gauthamp10/google-playstore-apps'
od.download(dataset_url)

Downloading google-playstore-apps.zip to ./google-playstore-apps


100%|██████████| 207M/207M [02:01<00:00, 1.78MB/s] 





#### 0.4 - Loading the dataset to a DataFrame

In [6]:
df = pd.read_csv('google-playstore-apps/Google-Playstore.csv')

## 1. Data Understanding

#### 1.1 - View the first 5 rows of the DataFrame

In [7]:
df.head(5)

Unnamed: 0,App Name,App Id,Category,Rating,Rating Count,Installs,Minimum Installs,Maximum Installs,Free,Price,...,Developer Website,Developer Email,Released,Last Updated,Content Rating,Privacy Policy,Ad Supported,In App Purchases,Editors Choice,Scraped Time
0,Gakondo,com.ishakwe.gakondo,Adventure,0.0,0.0,10+,10.0,15,True,0.0,...,https://beniyizibyose.tk/#/,jean21101999@gmail.com,"Feb 26, 2020","Feb 26, 2020",Everyone,https://beniyizibyose.tk/projects/,False,False,False,2021-06-15 20:19:35
1,Ampere Battery Info,com.webserveis.batteryinfo,Tools,4.4,64.0,"5,000+",5000.0,7662,True,0.0,...,https://webserveis.netlify.app/,webserveis@gmail.com,"May 21, 2020","May 06, 2021",Everyone,https://dev4phones.wordpress.com/licencia-de-uso/,True,False,False,2021-06-15 20:19:35
2,Vibook,com.doantiepvien.crm,Productivity,0.0,0.0,50+,50.0,58,True,0.0,...,,vnacrewit@gmail.com,"Aug 9, 2019","Aug 19, 2019",Everyone,https://www.vietnamairlines.com/vn/en/terms-an...,False,False,False,2021-06-15 20:19:35
3,Smart City Trichy Public Service Vehicles 17UC...,cst.stJoseph.ug17ucs548,Communication,5.0,5.0,10+,10.0,19,True,0.0,...,http://www.climatesmarttech.com/,climatesmarttech2@gmail.com,"Sep 10, 2018","Oct 13, 2018",Everyone,,True,False,False,2021-06-15 20:19:35
4,GROW.me,com.horodyski.grower,Tools,0.0,0.0,100+,100.0,478,True,0.0,...,http://www.horodyski.com.pl,rmilekhorodyski@gmail.com,"Feb 21, 2020","Nov 12, 2018",Everyone,http://www.horodyski.com.pl,False,False,False,2021-06-15 20:19:35


#### 1.2 - Let's see which columns make up the DataFrame

In [8]:
df.columns

Index(['App Name', 'App Id', 'Category', 'Rating', 'Rating Count', 'Installs',
       'Minimum Installs', 'Maximum Installs', 'Free', 'Price', 'Currency',
       'Size', 'Minimum Android', 'Developer Id', 'Developer Website',
       'Developer Email', 'Released', 'Last Updated', 'Content Rating',
       'Privacy Policy', 'Ad Supported', 'In App Purchases', 'Editors Choice',
       'Scraped Time'],
      dtype='object')

#### 1.3 - Let's look at the structure of the DataFrame (number of rows and columns)

In [11]:
df.shape

(2312944, 24)

> As we can see, the DataFrame is made up of 2312944 rows and 24 columns.

#### 1.4 - Let's get a concise summary of the DataFrame

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2312944 entries, 0 to 2312943
Data columns (total 24 columns):
 #   Column             Dtype  
---  ------             -----  
 0   App Name           object 
 1   App Id             object 
 2   Category           object 
 3   Rating             float64
 4   Rating Count       float64
 5   Installs           object 
 6   Minimum Installs   float64
 7   Maximum Installs   int64  
 8   Free               bool   
 9   Price              float64
 10  Currency           object 
 11  Size               object 
 12  Minimum Android    object 
 13  Developer Id       object 
 14  Developer Website  object 
 15  Developer Email    object 
 16  Released           object 
 17  Last Updated       object 
 18  Content Rating     object 
 19  Privacy Policy     object 
 20  Ad Supported       bool   
 21  In App Purchases   bool   
 22  Editors Choice     bool   
 23  Scraped Time       object 
dtypes: bool(4), float64(4), int64(1), object(15)
memor

The output from df.info() reveals crucial insights into the structure and composition of our dataset. Below is an in-depth analysis:

- **Total Entries**: Our dataset encompasses a vast amount of data, with 2,312,944 entries, indexed from 0 to 2,312,943. This extensive range of entries provides a solid foundation for comprehensive analysis.

- **Total Columns**: The dataset features 24 distinct columns, each representing unique attributes of apps available on the Google Play Store.

**Data Types Overview**:

- **Object** (15 columns): These columns contain textual or categorical information, including '__App Name__', '__App Id__', '__Category__', '__Installs__', '__Currency__', '__Size__', '__Minimum Android__', '__Developer Id__', '__Developer Website__', '__Developer Email__', '__Released__', '__Last Updated__', '__Content Rating__', '__Privacy Policy__', '__Scraped Time__'.

- **Float64** (4 columns): Numeric columns with floating point numbers are '__Rating__', '__Rating Count__', '__Minimum Installs__', '__Price__'. These typically represent quantitative attributes of the apps.

- **Int64** (1 column): '__Maximum Installs__' is the sole column representing integer values, indicating the maximum number of times an app has been installed.

- **Bool** (4 columns): Boolean columns like '__Free__', '__Ad Supported__', '__In App Purchases__', and '__Editors Choice__' provide binary (True/False) data, reflecting whether an app is free, has ads, offers in-app purchases, or is an editor's choice, respectively.

**Other info**:

- **Non-Null Counts**: The non-null counts next to each column help identify the presence of missing values, guiding the need for data cleaning or imputation strategies.

- **Memory Usage**: At approximately 361.8 MB, the memory usage indicates the dataset's size and can influence processing and analysis strategies, especially when working with limited resources.

This detailed summary not only aids in understanding the dataset's basic structure but also sets the stage for thorough data cleaning, preprocessing, and analysis. It highlights the need for specific attention to data type conversions, handling missing values, and potentially optimizing memory usage for more efficient computation.