Dataset extraction

Extracted the compressed Google Play Store dataset to access the raw CSV file.

In [3]:
import zipfile

with zipfile.ZipFile('googleplaystore.csv.zip', 'r') as zip_ref:
    zip_ref.extractall()



Directory verification

Verified extracted files to confirm successful dataset availability.

In [4]:
import os
os.listdir()




['.config', 'googleplaystore.csv', 'googleplaystore.csv.zip', 'sample_data']

Dataset loading

Loaded the raw Google Play Store dataset into a pandas DataFrame for preprocessing.

In [5]:
import pandas as pd

df = pd.read_csv('googleplaystore.csv')
df.head()


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


Column name standardization


Standardized column names by converting to lowercase and replacing spaces with underscores for consistency.

In [6]:
df.columns = df.columns.str.lower().str.replace(" ", "_")


Dataset dimensions

Verified the number of rows and columns after initial data loading and column standardization.

In [None]:
df.shape


(10841, 13)

Dataset structure overview

Inspected column data types and non-null counts to identify fields requiring preprocessing.

In [None]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [None]:
df['Installs'] = df['Installs'].astype(str)
df['Installs'] = df['Installs'].str.replace('+', '', regex=False)
df['Installs'] = df['Installs'].str.replace(',', '', regex=False)
df['Installs'] = pd.to_numeric(df['Installs'], errors='coerce')

Installs column preprocessing
Removed non-numeric characters and converted install counts to numeric values to enable quantitative analysis.

In [None]:

df['Price'] = df['Price'].astype(str)
df['Price'] = df['Price'].str.replace('$', '')
df['Price'] = pd.to_numeric(df['Price'], errors='coerce')


Price column preprocessing
Removed currency symbols and converted price values to numeric format for comparison between free and paid apps.

In [None]:
df['Price'].isnull().sum()
df['Price'].dtype


dtype('float64')

Price data validation
Verified missing values and confirmed numeric data type after preprocessing.

In [None]:
df['Reviews'] = df['Reviews'].astype(str)
df['Reviews'] = df['Reviews'].str.replace('M', '')
df['Reviews'] = pd.to_numeric(df['Reviews'], errors='coerce')



In [None]:
df['Reviews'].head()
df['Reviews'].dtype


dtype('float64')

Reviews column preprocessing
Converted review counts to numeric format to support statistical and correlation analysis.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  float64
 4   Size            10841 non-null  object 
 5   Installs        10840 non-null  float64
 6   Type            10840 non-null  object 
 7   Price           10840 non-null  float64
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(4), object(9)
memory usage: 1.1+ MB


## Data Cleaning Completed

- Converted Installs, Price, and Reviews from text to numeric format.
- Removed symbols and units to enable numerical analysis.
- Handled invalid and inconsistent values safely.
- The dataset is now analysis-ready.

The cleaned dataset has been saved to data/processed/cleaned_apps.csv and will be used for exploratory data analysis in a separate notebook.


In [13]:
import os
os.makedirs("../data/processed", exist_ok=True)



In [14]:
df.to_csv("../data/processed/cleaned_apps.csv", index=False)


The cleaned dataset has been saved to data/processed/cleaned_apps.csv and will be used for exploratory data analysis in a separate notebook.
