Hands-on Implementation - DE with Python and Pandas - Google Play Store App Analysis Pipeline #45

akash-coded · 2024-08-26T11:24:42Z

akash-coded
Aug 26, 2024
Maintainer

This is a hands-on implementation scenario based on the Google Play Store dataset, focusing on practical data engineering tasks. This project will guide you through data ingestion, transformation, and analysis using pandas.

Project: Google Play Store App Analysis Pipeline

Scenario:
You're a data engineer at a mobile app development startup. Your company wants to understand the Google Play Store market to make informed decisions about their next app. You've been tasked with building a data pipeline to ingest, process, and analyze the Google Play Store dataset.

Dataset: https://github.com/schlende/practical-pandas-projects/blob/master/datasets/google-play-store-11-2018.csv

Play Store Apps dataset - Why this dataset is interesting?

The Google Play Store is Google's app marketplace. Most people access the Google Play Store when they want to install new apps onto Android their phones.

Like any market apps in the play store are subject to supply and demand... that is to say that certain kinds of apps get downloaded a lot while others don't. Certain kinds of apps get paid for while others don't. Some categories of apps have lots and lots of competition while others don't.

A dataset like this can help you spot opportunities.

Ideas for questions this data can help you answer:

What categories of applications get a lot of downloads per day?
What categories of applications don't get many downloads per day?
In what app categories are there market leaders (one app that clearly is getting downloaded more than the others)?
How many downloads per day might you expect if you took the time to build an app?
What can the data tell you about monetization approaches?

Creating the Pipeline

Part 1: Data Ingestion and Initial Exploration

Load the dataset and perform initial exploration:

import pandas as pd

# Load the dataset
df = pd.read_csv('______')  # Fill in the URL

# Display basic information about the dataset
print(df._____)  # Display info about DataFrame
print(df._____(5))  # Display first 5 rows

Handle missing values:

# Check for missing values
print(df.______.sum())

# Fill missing values in numeric columns with median
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns]._____)

# Fill missing values in categorical columns with mode
categorical_columns = df.select_dtypes(include=['object']).columns
df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns]._____)

Part 2: Data Transformation

Convert 'Installs' column to numeric:

df['Installs'] = df['Installs'].str.replace('+', '').str.replace(',', '').astype(___)

Create a 'Price_USD' column:

df['Price_USD'] = df['Price'].str.replace('$', '').astype(___)

Convert 'Last Updated' to datetime:

df['Last_Updated'] = pd.______(df['Last Updated'])

Create a 'Days_Since_Update' column:

current_date = pd.Timestamp.now()
df['Days_Since_Update'] = (current_date - df['Last_Updated']).dt._____

Part 3: Analysis and Business Logic Application

Calculate downloads per day:

df['Downloads_Per_Day'] = df['Installs'] / df['Days_Since_Update']

Categorize apps based on downloads per day:

def categorize_popularity(downloads):
    if downloads < 10:
        return 'Low'
    elif downloads < 100:
        return 'Medium'
    else:
        return 'High'

df['Popularity'] = df['Downloads_Per_Day'].apply(_______)

Identify market leaders in each category:

market_leaders = df.groupby('Category').apply(lambda x: x.nlargest(1, 'Installs'))

Calculate average revenue per download:

df['Revenue_Per_Download'] = df['Price_USD'] / df['Installs']
df.loc[df['Type'] == 'Free', 'Revenue_Per_Download'] = 0.1  # Assume $0.1 per download for free apps

Part 4: Answering Business Questions

Top 5 categories by average downloads per day:

top_categories = df.groupby('Category')['Downloads_Per_Day']._____.sort_values(ascending=False).head()
print(top_categories)

Categories with market leaders (high install concentration):

def market_concentration(group):
    return group.nlargest(1, 'Installs')['Installs'].values[0] / group['Installs'].sum()

category_concentration = df.groupby('Category').apply(_____)
high_concentration_categories = category_concentration[category_concentration > 0.5]
print(high_concentration_categories)

Expected downloads for a new app:

median_downloads = df['Downloads_Per_Day'].median()
print(f"Median downloads per day: {median_downloads:.2f}")

Analyze monetization approaches:

# Compare free vs paid apps
monetization_comparison = df.groupby('Type')['Downloads_Per_Day'].agg(['mean', 'median'])
print(monetization_comparison)

# Analyze impact of in-app purchases
df['Has_In_App_Purchases'] = df['In-app Purchases'].notna()
iap_comparison = df.groupby('Has_In_App_Purchases')['Downloads_Per_Day'].agg(['mean', 'median'])
print(iap_comparison)

Part 5: Data Transformation for Downstream Use

Create a summary DataFrame for each category:

category_summary = df.groupby('Category').agg({
    'App': 'count',
    'Installs': 'sum',
    'Downloads_Per_Day': 'mean',
    'Rating': 'mean',
    'Price_USD': 'mean',
    'Revenue_Per_Download': 'mean'
}).rename(columns={
    'App': 'Total_Apps',
    'Installs': 'Total_Installs',
    'Downloads_Per_Day': 'Avg_Downloads_Per_Day',
    'Rating': 'Avg_Rating',
    'Price_USD': 'Avg_Price',
    'Revenue_Per_Download': 'Avg_Revenue_Per_Download'
})

# Round numeric columns to 2 decimal places
numeric_columns = category_summary.select_dtypes(include=['float64']).columns
category_summary[numeric_columns] = category_summary[numeric_columns].round(2)

print(category_summary.head())

Create a DataFrame of top apps in each category:

def top_apps_by_category(group):
    return group.nlargest(5, 'Downloads_Per_Day')[['App', 'Downloads_Per_Day', 'Rating', 'Price_USD']]

top_apps = df.groupby('Category').apply(______)
top_apps = top_apps.reset_index(level=1, drop=True).reset_index()
print(top_apps.head(10))

Part 6: Data Export for Downstream Use

Export processed data to CSV files:

# Export main processed DataFrame
df.to_csv('processed_play_store_data.csv', index=False)

# Export category summary
category_summary.to_csv('category_summary.csv')

# Export top apps
top_apps.to_csv('top_apps_by_category.csv', index=False)

Part 7: Final Analysis and Insights

Generate key insights:

print("Key Insights:")
print(f"1. The category with the highest average downloads per day is: {top_categories.index[0]}")
print(f"2. There are {len(high_concentration_categories)} categories with high market concentration")
print(f"3. The median number of downloads a new app might expect per day is: {median_downloads:.2f}")
print(f"4. {'Paid' if monetization_comparison.loc['Paid', 'mean'] > monetization_comparison.loc['Free', 'mean'] else 'Free'} apps have higher average downloads per day")
print(f"5. Apps {'with' if iap_comparison.loc[True, 'mean'] > iap_comparison.loc[False, 'mean'] else 'without'} in-app purchases tend to have more downloads per day")

This project guides students through a practical data engineering pipeline using the Google Play Store dataset. It covers data ingestion, cleaning, transformation, analysis, and export. The focus is on applying various pandas operations to derive meaningful insights from the data.

Key learning points include:

Handling missing data
Data type conversions
Feature engineering
Grouping and aggregation
Applying custom functions to DataFrames
Creating summary statistics
Exporting processed data for downstream use

akash-coded · 2024-08-27T09:39:56Z

akash-coded
Aug 27, 2024
Maintainer Author

Solution

# -*- coding: utf-8 -*-
"""pandas-project-google-play.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1V2CxbDr7xg3hHjOx6798gZOSIVrbSLmU

To analyze the download rates by app category using the google-play-store-11-2018.csv dataset, we'll need to use pandas to load and process the data, then perform various analyses to answer the given questions. Let's go through this step-by-step:

# First, let's import the necessary libraries and load the data:
"""

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Load the dataset
df = pd.read_csv(
    "https://raw.githubusercontent.com/schlende/practical-pandas-projects/master/datasets/google-play-store-11-2018.csv"
)

"""# Now, let's analyze each question:

## 1. What categories of applications get a lot of downloads per day?

To answer this, we'll calculate the average daily installs for each category:
"""

# Convert 'released' to datetime and calculate days since release
current_date = datetime.now()
df["released"] = pd.to_datetime(df["released"], errors="coerce")
df["days_since_release"] = (current_date - df["released"]).dt.days

# Calculate daily installs
df["daily_installs"] = df["min_installs"] / df["days_since_release"]

# Now we can proceed with the analysis
category_installs = (
    df.groupby("genre")["daily_installs"].mean().sort_values(ascending=False)
)

print("Top 5 categories with highest daily installs:")
print(category_installs.head())

plt.figure(figsize=(12, 6))
sns.barplot(x=category_installs.index[:10], y=category_installs.values[:10])
plt.title('Top 10 Categories by Average Daily Installs')
plt.xlabel('Category')
plt.ylabel('Average Daily Installs')
plt.xticks(rotation=45)
plt.show()

"""The above code calculates the daily installs for each app, then groups by genre to find the average daily installs per category. We then visualize the top 10 categories.

## 2. What categories of applications don't get many downloads per day?

We can use the same category_installs DataFrame from the previous question, but look at the bottom categories:
"""

print("\nBottom 5 categories with lowest daily installs:")
print(category_installs.tail())

plt.figure(figsize=(12, 6))
sns.barplot(x=category_installs.index[-10:], y=category_installs.values[-10:])
plt.title('Bottom 10 Categories by Average Daily Installs')
plt.xlabel('Category')
plt.ylabel('Average Daily Installs')
plt.xticks(rotation=45)
plt.show()

"""## 3. In what app categories are there market leaders?

To identify market leaders, we can look at the categories where the top app has significantly more downloads than the average:
"""

def market_leader_ratio(group):
    top_app = group.nlargest(1, 'min_installs')['min_installs'].values[0]
    avg_installs = group['min_installs'].mean()
    return top_app / avg_installs

market_leaders = df.groupby('genre').apply(market_leader_ratio).sort_values(ascending=False)

print("\nTop 5 categories with strong market leaders:")
print(market_leaders.head())

"""The above code calculates the ratio of the top app's installs to the average installs in each category. A high ratio indicates a strong market leader 3 .

## 4. How many downloads per day might you expect if you took the time to build an app?

To estimate this, we can look at the median daily installs across all apps:
"""

median_daily_installs = df['daily_installs'].median()
print(f"\nMedian daily installs: {median_daily_installs:.2f}")

"""This gives us a realistic expectation for a typical app 5 .

## 5. What can the data tell you about monetization approaches?
We can analyze the relationship between pricing, in-app purchases, ad support, and app popularity:
"""

# Compare free vs paid apps
df['is_paid'] = df['price'] > 0
monetization_comparison = df.groupby('is_paid')['daily_installs'].median()

print("\nMedian daily installs:")
print(f"Free apps: {monetization_comparison[False]:.2f}")
print(f"Paid apps: {monetization_comparison[True]:.2f}")

# Analyze in-app purchases and ad support
df['monetization'] = 'Free'
df.loc[df['price'] > 0, 'monetization'] = 'Paid'
df.loc[df['offers_iap'] == True, 'monetization'] = 'IAP'
df.loc[df['ad_supported'] == True, 'monetization'] = 'Ad Supported'

monetization_installs = df.groupby('monetization')['daily_installs'].median().sort_values(ascending=False)

print("\nMedian daily installs by monetization strategy:")
print(monetization_installs)

plt.figure(figsize=(10, 6))
sns.boxplot(x='monetization', y='daily_installs', data=df)
plt.title('Daily Installs by Monetization Strategy')
plt.ylabel('Daily Installs (log scale)')
plt.yscale('log')
plt.xticks(rotation=45)
plt.show()

# Analyze the relationship between price and daily installs for paid apps
paid_apps = df[df['price'] > 0]
plt.figure(figsize=(10, 6))
plt.scatter(paid_apps['price'], paid_apps['daily_installs'])
plt.title('Price vs Daily Installs for Paid Apps')
plt.xlabel('Price ($)')
plt.ylabel('Daily Installs (log scale)')
plt.yscale('log')
plt.show()

# Analyze the relationship between rating and daily installs
plt.figure(figsize=(10, 6))
plt.scatter(df['score'], df['daily_installs'])
plt.title('Rating vs Daily Installs')
plt.xlabel('Rating')
plt.ylabel('Daily Installs (log scale)')
plt.yscale('log')
plt.show()

# Analyze the top-grossing apps
df['estimated_revenue'] = df['daily_installs'] * df['price']
top_grossing = df.nlargest(10, 'estimated_revenue')
print("\nTop 10 Grossing Apps:")
print(top_grossing[['title', 'genre', 'price', 'daily_installs', 'estimated_revenue']])

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hands-on Implementation - DE with Python and Pandas - Google Play Store App Analysis Pipeline #45

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Hands-on Implementation - DE with Python and Pandas - Google Play Store App Analysis Pipeline #45

Uh oh!

akash-coded Aug 26, 2024 Maintainer

Play Store Apps dataset - Why this dataset is interesting?

Creating the Pipeline

Replies: 1 comment

Uh oh!

akash-coded Aug 27, 2024 Maintainer Author

Solution

akash-coded
Aug 26, 2024
Maintainer

akash-coded
Aug 27, 2024
Maintainer Author