# **Project Name**    -  Google Play Store App Review Exploratory Data Analysis

![Python Logo](https://gagadget.com/media/post_big/what-is-google-play-hero_2.jpg)


##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Google Play Store is one of the largest mobile app distribution platforms, hosting millions of applications across diverse categories such as Games, Education, Business, Health, and Lifestyle. With immense competition among apps, factors such as user reviews, ratings, installs, and category placement significantly impact an app’s success.

This project is based on a dataset containing 122,662 records across 17 attributes. The dataset provides valuable information such as app details, category, rating, number of reviews, size, installs, price, content rating, genres, update history, and user sentiment (translated reviews, sentiment polarity, and subjectivity).

Initial analysis highlights an important insight — while millions of apps are available, only a small fraction achieve consistently high ratings and positive sentiments, indicating the presence of hidden patterns that determine app popularity and user satisfaction.

The goal of this project is to uncover these patterns and generate actionable insights that can help app developers, marketers, and businesses optimize their strategies, enhance user experience, and increase downloads and engagement.

###✅ Approach Used
#### 1) Loading the Data :-
Loaded the dataset in Google Colab.
Shape of dataset: (122,662, 17).

#### 2) Data Cleaning and Processing:-
Removed duplicate records and irrelevant entries.
Handled missing values in Rating, Size, Sentiment columns.

#### 3) Analysis and Visualization:-

We explored important attributes such as Category, Rating, Reviews, Installs, Price, Content Rating, and Sentiment to identify patterns and trends. Key analyses include:

* Distribution of app ratings and reviews.

* Most popular categories and genres by number of apps.

* Free vs. Paid apps → install trends and revenue potential.

* Relationship between app size and rating.

* Sentiment analysis of user reviews → polarity & subjectivity trends.

* Effect of last updated date on app performance.

#### 4) Future Scope of Further Analysis:-

* Identifying category-wise trends in high-rated apps.

* Analyzing revenue contribution of paid apps across categories.

* Studying seasonal trends in app releases and updates.

* Deeper sentiment analysis (positive/negative reviews by category).

* Building a predictive model to estimate an app’s success based on reviews, ratings, and category.

### 📊 Types of Graphs Used for Data Visualization

* Count Plot
* Bar Plot
* Scatter Plot
* Heatmap
* Box Plot
* Pie Chart

### 🛠 Python Libraries Used

* Pandas → Data cleaning & analysis

* NumPy → Numerical operations

* Matplotlib → Visualizations

* Seaborn → Statistical plots

# **GitHub Link -**

https://github.com/YUVRAJSONDHIYA/Playstore_App_Reviews_Exploratory_data_Analysis

# **Problem Statement**


The Google Play Store hosts millions of mobile applications across diverse categories such as Games, Business, Education, and Lifestyle. With this massive competition, the success of an app depends on multiple factors — its category, ratings, user reviews, size, price model (free/paid), and regular updates.

Despite the abundance of apps, only a small fraction achieve high ratings and strong user sentiment, while many struggle to retain users. This creates important challenges for developers and businesses:

1.Which categories and genres dominate the Play Store, and how do they impact installs?

2.How do ratings and user reviews influence an app’s popularity and success?

3.What role does pricing (free vs paid) play in driving downloads and engagement?

4.How does user sentiment (positive, negative, neutral) from reviews correlate with app ratings and installs?

5.Do regular updates improve app ratings and user satisfaction over time?

By analyzing this dataset of 122,662 records and 17 attributes, the goal is to uncover hidden patterns behind app success and user satisfaction. The insights generated can help app developers, marketers, and businesses optimize their strategies, improve user experience, and build more successful applications in a highly competitive market.

#### **Define Your Business Objective?**

* The objective of this project is to analyze Google Play Store app data to identify the factors that contribute to an app’s success and user satisfaction. By leveraging insights from app details, user ratings, reviews, installs, and sentiments, this analysis will help:

* Identify Top-Performing Categories & Genres → Determine which types of apps dominate the market and attract the highest installs.

* Understand the Impact of Ratings & Reviews → Evaluate how user feedback influences app popularity and long-term success.

* Compare Free vs Paid Apps → Assess how pricing models affect downloads, user sentiment, and potential revenue.

* Analyze Sentiment Trends → Study user reviews (positive, negative, neutral) to understand key drivers of satisfaction and dissatisfaction.

* Evaluate the Effect of App Size & Updates → Check whether app size and regular updates influence ratings and user adoption.

* Provide Actionable Recommendations → Enable app developers, marketers, and businesses to make data-driven decisions for improving app quality, enhancing user engagement, and increasing competitiveness in the Play Store ecosystem.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

# Load the 'Play Store Data' dataset from the CSV file
play_store_data = pd.read_csv("/content/Play Store Data.csv")

# Load the 'User Reviews' dataset from the  CSV file
user_reviews = pd.read_csv("/content/User Reviews.csv")

# Merge the two datasets on the common column 'App'
# Perform an inner join to keep only the rows where 'App' exists in both DataFrames
data = pd.merge(play_store_data,user_reviews,on='App',how='inner')

### Dataset First View

In [None]:
# Dataset First Look
data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data.isnull().sum()

In [None]:
(data.isnull().sum()/len(data))*100

In [None]:
# Visualizing the missing values
plt.figure(figsize=(12, 6))
sns.heatmap(data.isnull(),
            cmap='rocket',  # better color scheme
            cbar=False,
            yticklabels=False)  # hides row numbers for cleaner view
plt.title('Heatmap of Missing Values', fontsize=15)
plt.xlabel('Columns')
plt.tight_layout()
plt.show()

### What did you know about your dataset?

The Google Play Store App Review dataset contains 122,662 records and 17 attributes, providing detailed information about mobile applications and their user feedback. It includes app-related details such as name, category, rating, reviews, size, installs, price, content rating, genres, and update history. Alongside these, it also captures translated user reviews, sentiment labels (positive, negative, neutral), and sentiment polarity/subjectivity scores, making it useful for both exploratory data analysis (EDA) and sentiment analysis. This dataset enables us to explore factors driving app popularity, user satisfaction, and market trends across different categories and genres.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe(include='all')

### Variables Description

* App : Name of the application.

* Category : Category of the app (e.g., Game, Business, Education, Lifestyle).

* Rating : Average user rating of the app (scale of 0–5).

* Reviews : Total number of reviews submitted for the app.

* Size : Size of the app in MB/KB (or “Varies with device”).

* Installs : Number of installs/downloads (e.g., 10,000+, 1,000,000+).

* Type : Specifies whether the app is Free or Paid.

* Price : Price of the app (0 for free apps, numeric value for paid apps).

* Content Rating : Age group suitability (e.g., Everyone, Teen, Mature 17+).

* Genres : Genre(s) of the app (e.g., Action, Puzzle, Lifestyle; multiple possible).

* Last Updated : Date when the app was last updated on the Play Store.

* Current Ver : Current version of the app available.

* Android Ver : Minimum Android version required to run the app.

* Translated_Review : User review translated into English.

* Sentiment : Sentiment label of the review (Positive, Negative, Neutral).

* Sentiment_Polarity : Polarity score (-1 to +1) showing how positive/negative a review is.

* Sentiment_Subjectivity : Subjectivity score (0 = objective, 1 = highly subjective opinion).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
# Loop through each column in the DataFrame and print the number of unique values for that column
for i in list(data.columns):
  print("No. of unique values in ",i,"is",data[i].nunique(),".")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Creating a copy of the current dataset and assigning to df
df = data.copy()

In [None]:
# Analysing the null values
df.isnull().sum()

In [None]:
# Analysing the null values in percentage
(df.isnull().sum()/len(df))*100

# Filling the missing value

In [None]:
# Check the skewness of 'Rating' column with to decide an appropriate imputation method
skewness = df['Rating'].skew()  # Negative skew indicates a left-skewed distribution

# Fill missing value in 'Rating' with median
rating_median = df["Rating"].median()
df.fillna({"Rating": rating_median}, inplace=True)

In [None]:
# Dropping rows with missing values in review/sentiment-related columns
# because these features are critical for EDA and cannot be imputed reliably.
# Keeping them would bias our sentiment analysis, so we ensure data quality by removing incomplete records.
# Drop rows where review or sentiment information is missing
df.dropna(
    subset=["Translated_Review", "Sentiment", "Sentiment_Polarity", "Sentiment_Subjectivity"],
    inplace=True
)

# Drop duplicate rows if any
df.drop_duplicates(inplace=True)

# Reset index after cleaning
df.reset_index(drop=True, inplace=True)

# Why Imputation is Not Used for Review & Sentiment Columns

1.Nature of Data:-

The missing values are in Translated_Review, Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity.
These columns contain text data and sentiment scores derived from that text.

2.Imputation Limitations:-

Unlike numeric columns (where mean/median can represent missing values), textual reviews and sentiments cannot be “guessed” without introducing bias.

Example: Filling a missing review with a placeholder or repeating another review would distort the analysis.

3.Impact on EDA:-

EDA relies on authentic distribution of sentiments and reviews. Artificially filled values would misrepresent audience opinions.
Sentiment scores are mathematically linked to the original review text. If the review is missing, imputing polarity/subjectivity is meaningless.

4.Data Integrity:-

Dropping rows with missing values ensures higher data quality and integrity.
This avoids introducing noise or false patterns in downstream analysis.

In [None]:
# Check duplicate values
df.duplicated().sum()

In [None]:
df.info()

In [None]:
# The 'Installs' column contains strings with '+' and ',' (e.g., 500,000+).
# These are non-numeric characters that prevent mathematical analysis.
# We remove '+' and ',' to clean the data, then convert the column into integers
df['Installs'] = df['Installs'].str.replace('[+,]', '', regex=True).astype(int)

In [None]:
# The 'Reviews' column is stored as object (string) even though it represents numbers.
# Converting it to integer type allows proper numerical operations (e.g., sorting, aggregations).
df['Reviews'] = df['Reviews'].astype(int)

In [None]:
# Convert the 'Price' column from string (with '$' sign) to float
# Step 1: Ensure everything is string
df['Price'] = df['Price'].astype(str)

# Step 2: Remove currency symbols, spaces, commas etc.
df['Price'] = df['Price'].str.replace(r'[^0-9.]', '', regex=True)

# Step 3: Replace empty strings with NaN
df['Price'] = df['Price'].replace('', np.nan)

# Step 4: Convert to float safely
df['Price'] = df['Price'].astype(float)

In [None]:
# Step 1: Drop rows where Size is 'Varies with device' or missing
df = df[df['Size'].notna() & (df['Size'] != 'Varies with device')].copy()

# Step 2: Convert remaining Size values to float (MB)
def size_to_mb(size):
    size = str(size).strip()
    if 'M' in size:
        return float(size.replace('M',''))
    elif 'k' in size or 'K' in size:
        return float(size.replace('k','').replace('K','')) / 1024
    else:
        return float(size)  # already numeric

df['Size'] = df['Size'].apply(size_to_mb)

# Step 3: Ensure dtype is float
df['Size'] = df['Size'].astype(float)

In [None]:
# Convert the 'Last Updated' column from string/object type to datetime format for easier time-based analysis
df['Last Updated'] = pd.to_datetime(df['Last Updated'])

In [None]:
# Convert Sentiment columns to float
df['Sentiment_Polarity'] = df['Sentiment_Polarity'].astype(float)
df['Sentiment_Subjectivity'] = df['Sentiment_Subjectivity'].astype(float)

In [None]:
# Extract Year and Month from Last Updated
df['Last_Updated_Year'] = df['Last Updated'].dt.year
df['Last_Updated_Month'] = df['Last Updated'].dt.month

### What all manipulations have you done and insights you found?

### 1. Created a Copy of Dataset

Made a copy of the original dataset (df = data.copy()) so that the raw data remains safe for future reference.

### 2. Handling Missing Values

* Rating → Filled missing values with median (chosen after checking skewness).

* Review & Sentiment-related columns (Translated_Review, Sentiment, Sentiment_Polarity, Sentiment_Subjectivity) → Rows with missing values were dropped because imputing text/sentiment data is not meaningful and could bias sentiment analysis.

* Ensured dataset completeness by removing incomplete records.

### 3. Removing Duplicates & Resetting Index

* Dropped duplicate rows to avoid bias in analysis.

* Reset the index so the dataset remains continuous after row deletions.

### 4. Data Type Conversions

* Installs → Removed non-numeric characters (+ and ,) and converted to integer.

* Reviews → Converted from string to integer.

* Price →
Removed $ and other non-numeric characters.
Converted cleaned column to float for numerical analysis.

* Size → Dropped rows with 'Varies with device' and converted remaining values to float (MB) for numeric analysis.

* Last Updated → Converted from string to datetime format.

* Sentiment_Polarity & Sentiment_Subjectivity → Converted to float.

### 5. Feature Engineering

* Extracted Last_Updated_Year and Last_Updated_Month from the Last Updated column for time-based analysis.

### 📊 Insights

* Missing Data: Rating was safely filled using median; sentiment-related missing rows were dropped to preserve analysis quality.

* Data Consistency: All numeric columns (Installs, Reviews, Price) are now in correct numerical types, enabling aggregation and visualization.

* Time Features: Extracted year & month allow trend analysis of app updates over time.

* Bias Reduction: Removing duplicates & incomplete sentiment rows ensures more reliable EDA results.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# set up the figure size for better visulization
plt.figure(figsize=(8,6))

# Creating a histplot with KDE(kernal Density Estimate) plot
# Plot distribution of Ratings
sns.histplot(df['Rating'], bins=24, kde=True, color='red')

# Add title and axis labels
plt.title("Distribution of App Ratings", fontsize=14, weight='bold')
plt.xlabel("App Rating")
plt.ylabel("Frequency")

# Adding grid for better readbility of values
plt.grid(True)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a Histogram with KDE because it clearly shows how app ratings are distributed. It helps identify rating patterns, central tendency, and skewness at a glance.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that most apps on the Play Store receive high ratings (above 4.0), with a strong peak around 4.2–4.6  . The distribution is left-skewed, indicating fewer low-rated apps. Very few apps fall below 3.5, suggesting an overall positive sentiment and user satisfaction with most apps

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can help create a positive business impact because the concentration of high ratings (above 4.0) reflects strong user satisfaction and trust, which is crucial for app downloads and retention.

However, the presence of a small portion of low-rated apps (<3.5) may lead to negative growth for those apps, as poor ratings can discourage new users and reduce visibility in the Play Store ranking. Hence, businesses should focus on improving user experience and fixing issues in low-rated apps to avoid negative impact.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

# set up the figure size for better visulization
plt.figure(figsize=(15,6))

# Count the number of apps in each category
category_counts = df['Category'].value_counts()

# Plot bar chart apps distributed across "Categories"
sns.barplot(x=category_counts.index, y=category_counts.values, color='red')

# Chart title and labels
plt.title("Distribution of Apps Across Categories", fontsize=16,weight='bold')
plt.xlabel("App Categories", fontsize=12)
plt.ylabel("Number of Apps", fontsize=12)
plt.xticks(rotation=75)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because it is the most effective way to compare apps across different categories. It clearly highlights which categories have the highest or lowest number of apps, making the distribution easy to understand at a glance.

##### 2. What is/are the insight(s) found from the chart?

### The insights from the chart are:

* Games dominate the Play Store, reflecting high user demand and developer focus.

* Lifestyle-related apps (Family, Health & Fitness, Travel, Tools) are also widely developed, showing their importance in daily life.

* There is a broad variety of app categories, from entertainment to niche areas, highlighting a diverse app ecosystem.

* Niche categories (Comics, Events, Maps & Navigation) have fewer apps, indicating limited user engagement or developer interest.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights will help create a positive business impact as they highlight where user demand is strongest (e.g., Games, Family, Health & Fitness), guiding developers and businesses to focus on high-growth categories.

On the other hand, niche categories with fewer apps (like Comics, Events, or Parenting) may show slower growth or limited demand, which could lead to negative business impact if companies over-invest in them without proper market research.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# set up the figure size for better visulization
plt.figure(figsize=(12,6))

# Count apps by Content Rating
content_counts = df["Content Rating"].value_counts()
content_counts.plot(kind="bar", color="blue", edgecolor="black")

# Chart title and labels
plt.title("Apps Distribution by Content Rating", fontsize=14)
plt.xlabel("Content Rating")
plt.ylabel("Number of Apps")
plt.xticks(rotation=45)
plt.grid(axis="y", linestyle="--", alpha=0.7)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a bar chart because it is the most effective way to compare the number of apps across different content rating categories. It makes it easy to see which audience group dominates and highlights differences at a glance.

##### 2. What is/are the insight(s) found from the chart?

* "Everyone" dominates → Most apps are designed for all age groups, showing developers target the widest audience.

* "Teen" is second-largest → A strong market exists for apps aimed at adolescents, especially in gaming, social, and entertainment.

* Age-restricted apps are fewer → Categories like Mature 17+ and Adults Only 18+ form a small niche, indicating limited demand or stricter guidelines.

* Implication → Developers seeking mass adoption should focus on Everyone and Teen categories, while niche developers may explore age-restricted apps with specific audiences.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact.

* The dominance of “Everyone” and “Teen” categories suggests that developers and businesses should focus on these ratings to maximize reach and adoption, as they cover the widest user base.

* These insights also help app marketers refine their target audience strategies, ensuring better alignment of ads and promotions with user demographics

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# set up the figure size for better visulization
plt.figure(figsize=(12,6))
# Plot histogram of App Size
plt.hist(df['Size'], bins=50, color="red", edgecolor="black")

# Chart title and labels
plt.xlabel("App Size (MB)")
plt.ylabel("Number of Apps")
plt.title("Distribution of App Sizes (MB)")

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A histogram was used in this code to show how app sizes (in MB) are distributed across the dataset. It helps visualize which size ranges are most common and identify any outliers.

##### 2. What is/are the insight(s) found from the chart?

* Most apps are small: A large number of apps are below 40 MB, indicating smaller apps dominate the market.

* Fewer large apps: The number of apps decreases as size increases, with very few apps above 80 MB.

* Popular size ranges: Certain smaller size ranges (e.g., below 20 MB) have peaks, showing common app size preferences.

* Large apps are rare: Very large apps (near 100 MB) are uncommon compared to smaller ones.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact.

* Knowing that most apps are small (below 40 MB) can guide developers and businesses to optimize their app sizes, ensuring faster downloads and wider reach, especially in regions with limited internet speed or storage.

* Identifying popular size ranges allows businesses to align their app development with market trends, increasing the chances of adoption.


Potential Negative Growth:

* The rarity of large apps (>80 MB) may indicate user reluctance to download heavy apps, which could negatively impact businesses planning to launch feature-rich or resource-heavy applications.

* If a business ignores this trend and releases large apps without optimization, it may face lower downloads and user engagement, leading to negative growth.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# set up the figure size for better visulization
plt.figure(figsize=(12,6))

# Count number of apps updated each year
updates_per_year = df['Last_Updated_Year'].value_counts().sort_index()

# Plot line chart
plt.plot(updates_per_year.index, updates_per_year.values, marker='o', color='blue')
# Chart title and labels
plt.title("Trend of App Updates by Year")
plt.xlabel("Year")
plt.ylabel("Number of Apps Updated")

# Adding grid for better readbility of values
plt.grid(True)

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A line chart was chosen because it is ideal for showing trends over time. It clearly displays how the number of apps updated changes year by year, making it easy to identify increases, decreases, or patterns in app updates.

##### 2. What is/are the insight(s) found from the chart?

The chart shows an overall upward trend in app updates from 2011 to 2018, with a sharp surge between 2016–2018. Updates remained high afterward, indicating a growing and active app ecosystem.



* Sharp surge (2016–2017): There was a dramatic increase in updates, indicating rapid app development or new update policies.

* High levels maintained (2017–2018): After the surge, updates remained high, though growth slowed.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact.

* The upward trend and sharp surge in updates indicate that developers are actively maintaining apps, which helps businesses plan timely updates to stay competitive.

* Knowing that updates remain high allows companies to invest in app improvement and new features, improving user engagement and retention.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# set up the figure size for better visulization
plt.figure(figsize=(12,6))

# Boxplot for Price vs App Type
plt.subplot(1, 2, 1)
sns.boxplot(x='Type', y='Price', data=df, color='blue')

# Chart title and labels
plt.title("Price Distribution by App Type")
plt.ylabel("Price ($)")
plt.xlabel("App Type")

# Boxplot for Rating vs App Type
plt.subplot(1, 2, 2)
sns.boxplot(x='Type', y='Rating', data=df, color='red')

# Chart title and labels
plt.title("Rating Distribution by App Type")
plt.ylabel("Rating")
plt.xlabel("App Type")
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A box plot was chosen because it effectively shows the distribution of a numeric variable (Price or Rating) across categories (Free vs Paid apps).

* It displays the median, quartiles, and outliers, helping compare central tendency and variability between Free and Paid apps.

* It makes it easy to spot differences in price and rating trends between the two app types.

##### 2. What is/are the insight(s) found from the chart?

The insights from the box plots are:

Price Distribution:

* Free apps have a median price of $0, while paid apps have a higher median price.

* Paid apps show a wider price range, indicating a few apps are significantly more expensive.

Rating Distribution:

* Both free and paid apps generally have high median ratings, though free apps have slightly higher median ratings.

* Paid apps have more consistent ratings (smaller interquartile range), while free apps show more low-rated outliers.

* Overall, paid apps tend to have more stable quality, while free apps exhibit a broader range of ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help create a positive business impact

* Businesses can price paid apps strategically, knowing that higher-priced apps are accepted by users but outliers indicate opportunities for premium offerings.

* Insights about rating consistency suggest that investing in paid apps can lead to more stable user satisfaction, which supports retention and positive reviews

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# Optionally remove extreme outliers in Reviews
df_plot = df[df['Reviews'] < 500000].copy()  # adjust threshold if needed

# Set up the figure size for better visualization
plt.figure(figsize=(14,7))

# Scatter plot
sns.scatterplot(
    x='Size',
    y='Reviews',
    data=df_plot,
    alpha=0.5,
    color='blue',
    s=150
)

# Chart title and labels
plt.title("Relationship between App Size and Number of Reviews")
plt.xlabel("App Size (MB)")
plt.ylabel("Number of Reviews")
plt.tight_layout()

# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot was chosen because it clearly shows the relationship between App Size and Number of Reviews. It helps identify trends, clusters, and outliers, revealing whether larger apps tend to get more or fewer reviews.

##### 2. What is/are the insight(s) found from the chart?

The insights from the scatter plot are:

* No strong correlation: App size does not consistently determine the number of reviews.

* Smaller apps dominate: Most apps are small (under 20–30 MB) and have fewer reviews (under 100,000).

* High-review outliers: Some smaller and medium-sized apps have very high reviews, indicating factors other than size drive popularity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact: Developers can focus on smaller, optimized apps since app size doesn’t limit review potential, and studying high-review outliers can guide feature and marketing improvements.

Negative Growth: Investing heavily in very large apps may lead to poor engagement, as larger apps rarely achieve high reviews, so overemphasis on size over quality could hurt growth

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# set up the figure size for better visulization
plt.figure(figsize=(6,5))

# Count plot
sns.countplot(x='Type', data=df,color="yellow",palette =["orange","blue"],legend=True)

# Chart title and labels
plt.title("Number of Apps by Type (Free vs Paid)")
plt.xlabel("App Type")
plt.ylabel("Number of Apps")

plt.tight_layout()
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A count plot was chosen because it clearly shows the number of apps in each category (Free vs Paid).

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart are:

* Majority are Free apps: Most apps in the Play Store are free.

* Few Paid apps: Paid apps make up only a small portion of the total apps.

* This suggests that developers primarily focus on free apps, likely relying on ads or in-app purchases for revenue.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Business Impact:

* Businesses can focus on free apps with monetization strategies like ads or in-app purchases, as they dominate the market.

* Shows opportunity for innovative features or premium upgrades within free apps to generate revenue.

Potential Negative Growth:

* Fewer paid apps indicate high competition for users willing to pay upfront.

* Launching a paid app without a strong value proposition could lead to low downloads and limited growth.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

# Get top 10 categories by number of apps
top10_categories = df['Category'].value_counts().head(10)

# set up the figure size for better visulization
plt.figure(figsize=(10,6))

# pie chart
plt.pie(top10_categories, labels=top10_categories.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.tab20.colors)
plt.title("Top 10 App Categories by Number of Apps")
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

A pie chart was chosen because it clearly shows the proportion of apps in each category.

##### 2. What is/are the insight(s) found from the chart?

The insights from the pie chart are:

* Games dominate the Play Store: The GAME category accounts for 35.7% of the top 10 categories, making it the largest segment by far.

* Family and Health apps are significant: FAMILY (13.6%) and HEALTH_AND_FITNESS (7.3%) also hold a notable share.

* Other categories share the remainder: Categories like Travel, Tools, Productivity, Sports, Finance, Photography, and Dating each contribute between 5–7%, indicating smaller but meaningful niches.

* Highly concentrated market: A few categories (especially games) dominate the app ecosystem, while many categories have much smaller shares, highlighting both competitive and niche opportunities.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can create a positive business impact by focusing on high-demand categories like Games, Family, and Health & Fitness, which have large user bases and engagement. Niche categories offer opportunities with less competition, allowing innovation. However, the high concentration in a few categories poses a risk of intense competition, while smaller categories may limit growth due to their smaller audience.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

# Count reviews by Sentiment
sentiment_counts = df['Sentiment'].value_counts()

# set up the figure size for better visulization
plt.figure(figsize=(7,5))
# Pie chart
plt.pie(sentiment_counts, labels=sentiment_counts.index, autopct='%1.1f%%', startangle=140, colors=['orange','red','green'])
plt.title("Percentage Distribution of App Reviews by Sentiment")
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a pie chart because it visually represents the proportion of app reviews across different sentiment categories (Positive, Negative, Neutral). Pie charts make it easy to see which sentiment dominates at a glance, showing relative shares clearly, which is ideal for categorical data like sentiment. It’s especially useful when the goal is to compare parts of a whole rather than exact counts.

##### 2. What is/are the insight(s) found from the chart?

The insights from the pie chart are:

* High Positive Sentiment: 63.6% of reviews are positive, indicating that most users are satisfied with the apps.

* Notable Negative Feedback: 23.9% of reviews are negative, highlighting areas where apps may need improvements.

* Neutral Reviews are Small: 12.5% of reviews are neutral, showing a minor portion of users are indifferent.

* Business Implication: Overall, the market perception is favorable, but addressing negative feedback can further improve app ratings, downloads, and user retention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights can help businesses improve and grow. The high percentage of positive reviews (63.6%) indicates strong user satisfaction, which can be leveraged for marketing, user acquisition, and retention strategies. Understanding what users like can guide feature improvements and enhance app engagement.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

# Group by Year and Sentiment, count reviews
review_counts = df.groupby(['Last_Updated_Year', 'Sentiment'])['Translated_Review'].count().reset_index()
review_counts.rename(columns={'Translated_Review':'Review_Count'}, inplace=True)

# set up the figure size for better visulization
plt.figure(figsize=(12,6))

# Bar plot
sns.barplot(
    data=review_counts,
    x='Last_Updated_Year',
    y='Review_Count',
    hue='Sentiment',
    palette='Set2'
)
# Chart title and labels
plt.xlabel('Year')
plt.ylabel('Number of Reviews')
plt.title('Count of Reviews by Sentiment for Each Year')
plt.legend(title='Sentiment')
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a bar plot because it clearly shows the count of reviews across different categories (Sentiments) for each year. Bar plots are ideal for comparing discrete groups and highlighting differences in magnitude, making it easy to see which sentiment dominates in a particular year and how trends change over time.

##### 2. What is/are the insight(s) found from the chart?

The chart shows that positive reviews consistently dominate from 2015 to 2018, reflecting generally favorable user perception. From 2016 to 2018, there is a notable upward trend in positive reviews, with a dramatic surge in 2018, indicating higher user engagement and satisfaction. Early years (2011–2015) had low review counts, while 2017 shows gradual growth. Overall, the proportion of sentiments remains consistent despite changes in review volume.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights can drive a positive business impact. The dominance of positive reviews shows strong user satisfaction, which can boost app reputation and marketing efforts. The sharp rise in positive reviews in 2018 highlights successful updates or features, guiding future improvements. Gradual growth in 2017 indicates increasing user engagement, while consistent sentiment proportions help prioritize areas for maintaining quality and addressing negative feedback. Overall, this can enhance ratings, downloads, and revenue

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# Group by Year and calculate average Sentiment_Polarity
polarity_trend = df.groupby('Last_Updated_Year')['Sentiment_Polarity'].mean().reset_index()

# set up the figure size for better visulization
plt.figure(figsize=(12,6))

# Line plot
plt.plot(
    polarity_trend['Last_Updated_Year'],
    polarity_trend['Sentiment_Polarity'],
    marker='o',
    linestyle='-',
    color='green'
)
# Chart title and labels
plt.xlabel('Year')
plt.ylabel('Average Sentiment Polarity')
plt.title('Change in Average Sentiment Polarity of Reviews Over the Years')
plt.grid(True)
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a line chart because it is ideal for showing trends over time. Since we want to track how the average Sentiment_Polarity changes across years, a line chart clearly highlights increases, decreases, or stable patterns, making it easy to interpret temporal changes in user sentiment.

##### 2. What is/are the insight(s) found from the chart?

The insights from the chart are:

* Initial Decline (2011–2012): Average sentiment polarity dropped sharply, indicating a rise in negative reviews or reduced positive sentiment.

* Recovery and Fluctuation (2012–2015): Sentiment gradually improved with minor fluctuations, showing a recovery in user satisfaction.

* Peak Sentiment (~2015): Sentiment polarity reached its highest point, reflecting the most positive reviews in the period.

* Subsequent Decline (2015–2017): Average sentiment fell again, indicating a decrease in positive user sentiment.

* Slight Rebound (2017–2018): Sentiment shows modest improvement or stabilization, though not reaching the 2015 peak.

* Overall Trend: Sentiment polarity varied significantly over the years, highlighting dynamic changes in user experience and opinions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights can help create a positive business impact by showing when user sentiment was high, guiding strategies to replicate successful features or updates. The slight rebound in 2017–2018 indicates opportunities to further improve user experience. However, the declines in 2011–2012 and 2015–2017 highlight periods of lower satisfaction, which could lead to reduced engagement or revenue. Identifying and addressing issues from these periods can prevent negative growth. Overall, tracking sentiment trends helps optimize app development and user retention.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

# Count number of reviews by Sentiment per year
sentiment_year = df.groupby(['Last_Updated_Year', 'Sentiment']).size().unstack(fill_value=0)

# set up the figure size for better visulization
plt.figure(figsize=(10,6))

#Plot stacked area chart
sentiment_year.plot(kind='area', stacked=True,color=['blue', 'green', 'red'], alpha=0.8)

# Chart title and labels
plt.title('Sentiment Distribution of App Reviews Over Years', fontsize=14)
plt.xlabel('Year', fontsize=12)
plt.ylabel('Number of Reviews', fontsize=12)
plt.legend(title='Sentiment')
# Show the plot
plt.show()

##### 1. Why did you pick the specific chart?

I picked a stacked area chart because it clearly shows how Positive, Neutral, and Negative sentiments change over time. It allows comparison of all three categories in one view and highlights trends in user reviews across years. Light colors make the chart visually clean and easy to interpret.

##### 2. What is/are the insight(s) found from the chart?

Here are the key insights from the chart:

* Growth in Reviews: The total number of reviews increased steadily over the years, with a sharp rise in 2017–2018.

* Positive Sentiment Dominates: Positive reviews consistently make up the largest share and show the most growth.

* Rising Negative & Neutral Reviews: Both negative and neutral reviews also increased in recent years, especially after 2016.

* Shift in Proportions: While positives remain dominant, the share of neutral and negative reviews became more visible in later years, indicating a broader spread of user opinions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes ✅, the gained insights can help create a positive business impact:

* Product Improvement: The rise in negative and neutral reviews alerts businesses to identify common issues and improve app performance, features, or user experience.

* Customer Engagement: The dominance of positive reviews shows strong user satisfaction, which businesses can leverage in marketing campaigns to build trust.

* Trend Monitoring: Tracking sentiment over time helps businesses see whether updates are improving user perception or causing dissatisfaction, guiding future development strategies.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

# Filtering numerical data
num_data = df.select_dtypes(include=["int64","float64"])
correl = num_data.corr()

# Set up figure size for better visualization
plt.figure(figsize = (8,6))

# Plot heatmap
sns.heatmap(correl,annot = True,cmap = "rainbow")
plt.tight_layout()

# Display the heatmap
plt.show()

##### 1. Why did you pick the specific chart?

The heatmap was chosen as it clearly visualizes correlations between numerical variables, making patterns and relationships easy to interpret at a glance.

##### 2. What is/are the insight(s) found from the chart?

Here are the insights from the heatmap correlation matrix:

* Interconnected Growth Factors: Reviews, Installs, and Size are strongly correlated, showing that bigger apps tend to attract more installs and generate more reviews.

* User Feedback Link: Sentiment Polarity and Sentiment Subjectivity are positively related, meaning more emotional (subjective) reviews are often more polarized.

* Price Independence: Price shows almost no meaningful correlation with ratings, installs, reviews, or sentiments, suggesting it does not drive user behavior strongly.

* Weak Influence of Ratings: Ratings have only weak links with Reviews and Size, indicating that factors beyond just app popularity or size affect user ratings.

* Negative Sentiment Trend: Larger user bases (more installs/reviews) slightly lower sentiment polarity, likely due to more diverse and critical feedback.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
key_cols = ['Rating', 'Reviews', 'Installs', 'Price', 'Sentiment_Polarity']
df_key = df[key_cols].dropna()

sns.pairplot(df_key, diag_kind='kde', plot_kws={'alpha':0.5, 's':40, 'edgecolor':'k'})
plt.suptitle("Pairwise Relationships (Key Features)", y=1.02)
plt.show()


##### 1. Why did you pick the specific chart?

I chose the pairplot because it shows both the distribution of each variable and the relationships between pairs of variables, making it easier to detect patterns, correlations, or clusters in the dataset

##### 2. What is/are the insight(s) found from the chart?

The chart shows that apps with higher installs also receive more reviews, confirming strong user engagement. Ratings are mostly clustered between 4.2–4.6, indicating overall positive user experience. Price has little effect on installs or reviews, as most apps are free or low-priced. Sentiment polarity is generally positive but not strongly tied to popularity.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

1. Focus on High-Demand Categories
Since Games, Family, and Lifestyle apps dominate installs and engagement, developers should prioritize these categories or identify innovative niches where competition is lower but demand exists.

2. Leverage User Sentiment & Feedback
With 63% positive reviews but nearly 24% negative, businesses must address recurring issues flagged in negative reviews. Proactive updates and bug fixes can turn dissatisfied users into loyal ones, improving retention.

3. Optimize Pricing Strategy
Since most apps are free and price shows little impact on ratings or installs, focusing on freemium models (ads, in-app purchases, subscriptions) will likely generate better revenue than high upfront prices.

4. Keep Apps Lightweight & Regularly Updated
Users prefer smaller apps, and frequent updates (2016–2018 surge) are strongly linked with better user perception and higher installs. Businesses should optimize app size and maintain an active update cycle to stay competitive.

5. Target the Widest Audience
With “Everyone” being the largest content rating group, apps designed for universal use (with teen-friendly features) are more likely to gain traction. Niche audiences (17+ apps) can still be explored but offer limited market share.

# **Conclusion**

The analysis highlights that success on the Play Store is driven by popular categories (Games, Family, Lifestyle), lightweight apps, frequent updates, and positive user engagement. Since most apps are free, adopting freemium models is more effective for revenue. Targeting the widest audience with “Everyone” content rating while addressing user feedback will help apps achieve higher installs, better ratings, and sustainable growth.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***