Hands-on Implementation - DE with Python and Pandas - Google Play Store App Analysis Pipeline #45
akash-coded
started this conversation in
Tasks
Replies: 1 comment
-
Solution# -*- coding: utf-8 -*-
"""pandas-project-google-play.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1V2CxbDr7xg3hHjOx6798gZOSIVrbSLmU
To analyze the download rates by app category using the google-play-store-11-2018.csv dataset, we'll need to use pandas to load and process the data, then perform various analyses to answer the given questions. Let's go through this step-by-step:
# First, let's import the necessary libraries and load the data:
"""
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# Load the dataset
df = pd.read_csv(
"https://raw.githubusercontent.com/schlende/practical-pandas-projects/master/datasets/google-play-store-11-2018.csv"
)
"""# Now, let's analyze each question:
## 1. What categories of applications get a lot of downloads per day?
To answer this, we'll calculate the average daily installs for each category:
"""
# Convert 'released' to datetime and calculate days since release
current_date = datetime.now()
df["released"] = pd.to_datetime(df["released"], errors="coerce")
df["days_since_release"] = (current_date - df["released"]).dt.days
# Calculate daily installs
df["daily_installs"] = df["min_installs"] / df["days_since_release"]
# Now we can proceed with the analysis
category_installs = (
df.groupby("genre")["daily_installs"].mean().sort_values(ascending=False)
)
print("Top 5 categories with highest daily installs:")
print(category_installs.head())
plt.figure(figsize=(12, 6))
sns.barplot(x=category_installs.index[:10], y=category_installs.values[:10])
plt.title('Top 10 Categories by Average Daily Installs')
plt.xlabel('Category')
plt.ylabel('Average Daily Installs')
plt.xticks(rotation=45)
plt.show()
"""The above code calculates the daily installs for each app, then groups by genre to find the average daily installs per category. We then visualize the top 10 categories.
## 2. What categories of applications don't get many downloads per day?
We can use the same category_installs DataFrame from the previous question, but look at the bottom categories:
"""
print("\nBottom 5 categories with lowest daily installs:")
print(category_installs.tail())
plt.figure(figsize=(12, 6))
sns.barplot(x=category_installs.index[-10:], y=category_installs.values[-10:])
plt.title('Bottom 10 Categories by Average Daily Installs')
plt.xlabel('Category')
plt.ylabel('Average Daily Installs')
plt.xticks(rotation=45)
plt.show()
"""## 3. In what app categories are there market leaders?
To identify market leaders, we can look at the categories where the top app has significantly more downloads than the average:
"""
def market_leader_ratio(group):
top_app = group.nlargest(1, 'min_installs')['min_installs'].values[0]
avg_installs = group['min_installs'].mean()
return top_app / avg_installs
market_leaders = df.groupby('genre').apply(market_leader_ratio).sort_values(ascending=False)
print("\nTop 5 categories with strong market leaders:")
print(market_leaders.head())
"""The above code calculates the ratio of the top app's installs to the average installs in each category. A high ratio indicates a strong market leader 3 .
## 4. How many downloads per day might you expect if you took the time to build an app?
To estimate this, we can look at the median daily installs across all apps:
"""
median_daily_installs = df['daily_installs'].median()
print(f"\nMedian daily installs: {median_daily_installs:.2f}")
"""This gives us a realistic expectation for a typical app 5 .
## 5. What can the data tell you about monetization approaches?
We can analyze the relationship between pricing, in-app purchases, ad support, and app popularity:
"""
# Compare free vs paid apps
df['is_paid'] = df['price'] > 0
monetization_comparison = df.groupby('is_paid')['daily_installs'].median()
print("\nMedian daily installs:")
print(f"Free apps: {monetization_comparison[False]:.2f}")
print(f"Paid apps: {monetization_comparison[True]:.2f}")
# Analyze in-app purchases and ad support
df['monetization'] = 'Free'
df.loc[df['price'] > 0, 'monetization'] = 'Paid'
df.loc[df['offers_iap'] == True, 'monetization'] = 'IAP'
df.loc[df['ad_supported'] == True, 'monetization'] = 'Ad Supported'
monetization_installs = df.groupby('monetization')['daily_installs'].median().sort_values(ascending=False)
print("\nMedian daily installs by monetization strategy:")
print(monetization_installs)
plt.figure(figsize=(10, 6))
sns.boxplot(x='monetization', y='daily_installs', data=df)
plt.title('Daily Installs by Monetization Strategy')
plt.ylabel('Daily Installs (log scale)')
plt.yscale('log')
plt.xticks(rotation=45)
plt.show()
# Analyze the relationship between price and daily installs for paid apps
paid_apps = df[df['price'] > 0]
plt.figure(figsize=(10, 6))
plt.scatter(paid_apps['price'], paid_apps['daily_installs'])
plt.title('Price vs Daily Installs for Paid Apps')
plt.xlabel('Price ($)')
plt.ylabel('Daily Installs (log scale)')
plt.yscale('log')
plt.show()
# Analyze the relationship between rating and daily installs
plt.figure(figsize=(10, 6))
plt.scatter(df['score'], df['daily_installs'])
plt.title('Rating vs Daily Installs')
plt.xlabel('Rating')
plt.ylabel('Daily Installs (log scale)')
plt.yscale('log')
plt.show()
# Analyze the top-grossing apps
df['estimated_revenue'] = df['daily_installs'] * df['price']
top_grossing = df.nlargest(10, 'estimated_revenue')
print("\nTop 10 Grossing Apps:")
print(top_grossing[['title', 'genre', 'price', 'daily_installs', 'estimated_revenue']]) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
This is a hands-on implementation scenario based on the Google Play Store dataset, focusing on practical data engineering tasks. This project will guide you through data ingestion, transformation, and analysis using pandas.
Project: Google Play Store App Analysis Pipeline
Scenario:
You're a data engineer at a mobile app development startup. Your company wants to understand the Google Play Store market to make informed decisions about their next app. You've been tasked with building a data pipeline to ingest, process, and analyze the Google Play Store dataset.
Dataset: https://github.com/schlende/practical-pandas-projects/blob/master/datasets/google-play-store-11-2018.csv
Play Store Apps dataset - Why this dataset is interesting?
The Google Play Store is Google's app marketplace. Most people access the Google Play Store when they want to install new apps onto Android their phones.
Like any market apps in the play store are subject to supply and demand... that is to say that certain kinds of apps get downloaded a lot while others don't. Certain kinds of apps get paid for while others don't. Some categories of apps have lots and lots of competition while others don't.
A dataset like this can help you spot opportunities.
Ideas for questions this data can help you answer:
Creating the Pipeline
Part 1: Data Ingestion and Initial Exploration
Part 2: Data Transformation
Part 3: Analysis and Business Logic Application
Part 4: Answering Business Questions
Part 5: Data Transformation for Downstream Use
Part 6: Data Export for Downstream Use
Part 7: Final Analysis and Insights
This project guides students through a practical data engineering pipeline using the Google Play Store dataset. It covers data ingestion, cleaning, transformation, analysis, and export. The focus is on applying various pandas operations to derive meaningful insights from the data.
Key learning points include:
Beta Was this translation helpful? Give feedback.
All reactions