<a href="https://colab.research.google.com/github/bashayantan/Play-Store-App-Review-Analysis-project/blob/main/Play_Store_App_Review_Analysis_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - Exploratory Data Analysis (EDA)
##### **Contribution**    - Individual
##### **Member  -**  Shayantan Banerjee


# **Project Summary -**

The objective of this project was to perform an in-depth Exploratory Data Analysis (EDA) on Google Play Store applications and their corresponding user reviews in order to identify the key factors that drive app engagement, user satisfaction, and overall success in the Android ecosystem. The analysis combined app-level metadata with user sentiment data to provide a holistic, data-driven view of the Play Store market.

The project utilized two datasets: the Play Store Apps dataset, which contains information such as app category, rating, installs, size, type (free/paid), price, and reviews; and the User Reviews dataset, which includes translated review text along with sentiment labels, polarity, and subjectivity. Together, these datasets enabled both quantitative performance analysis and qualitative sentiment-based insights.

A significant portion of the work focused on data wrangling and preprocessing. Raw datasets contained missing values, duplicate app entries, and incorrect data types. These issues were systematically addressed by removing duplicates, handling missing values using business-justified strategies, converting string-based numerical columns (such as installs, price, and size) into usable numeric formats, and engineering new features like log-transformed installs, reviews-per-install ratio, and app-level sentiment metrics. This ensured that the data was clean, consistent, and analysis-ready.

The exploratory analysis followed a structured UBM (Univariate, Bivariate, Multivariate) approach and included more than 15 meaningful visualizations, each supported by business interpretation. Univariate analysis revealed that most apps are highly rated, with ratings clustered between 4.0 and 4.5, indicating generally positive user sentiment across the Play Store. Category-wise analysis showed that a few categories dominate in terms of app count and installs, highlighting areas of intense competition as well as opportunities in less saturated categories.

Bivariate analysis uncovered strong relationships between key variables. Higher-rated apps generally achieve higher install counts, confirming that user satisfaction directly influences adoption. App size analysis showed that smaller apps tend to attract more installs, especially in markets with limited storage or bandwidth. Comparisons between free and paid apps indicated that paid apps often have slightly higher and more consistent ratings, suggesting better perceived quality, while free apps show greater variability due to ads or performance issues.

Multivariate analysis, including correlation heatmaps and pair plots, reinforced these findings by showing strong correlations between installs and reviews, as well as between ratings and sentiment polarity. Aggregated sentiment analysis demonstrated that apps with more positive user sentiment consistently receive higher ratings, making sentiment an important early indicator of app health and future performance.

From a business perspective, the insights generated are highly actionable. Developers and product teams can use these findings to prioritize quality improvements, sentiment monitoring, size optimization, and strategic category selection. Encouraging user engagement through reviews and actively addressing negative feedback can significantly enhance app visibility, trust, and long-term growth.

In conclusion, this project successfully transformed raw Play Store data into a reliable analytical asset and delivered clear insights into the drivers of app success. The analysis supports informed decision-making for developers, marketers, and business stakeholders, enabling them to build competitive, user-centric, and scalable applications in the Android marketplace.

# **GitHub Link -**

https://github.com/bashayantan/Play-Store-App-Review-Analysis-project/tree/main

# **Problem Statement**


The Google Play Store hosts millions of applications across diverse categories, making it highly competitive for developers to gain visibility, user engagement, and sustained growth. Despite the availability of large volumes of app performance data and user reviews, many developers and businesses struggle to identify the key factors that contribute to an app’s success, such as high ratings, increased installs, and positive user sentiment.

The objective of this project is to analyze Google Play Store app data and user reviews to uncover meaningful patterns and relationships that influence app engagement and performance. By systematically cleaning, exploring, and visualizing the data, the project aims to identify how factors such as app category, size, pricing model, ratings, installs, and user sentiment impact overall app success.

This analysis will help stakeholders make data-driven decisions to improve app quality, optimize market positioning, enhance user satisfaction, and ultimately increase adoption and retention in a highly competitive mobile app ecosystem.

#### **Define Your Business Objective?**

The primary business objective of this project is to identify and analyze the key factors that drive the success of Google Play Store applications, measured through user engagement, ratings, installs, and sentiment.

By leveraging app metadata and user review sentiment, the objective is to provide actionable insights that help developers and businesses:

Improve app quality and user experience

Increase user satisfaction and positive ratings

Optimize app size, pricing, and category positioning

Enhance user engagement and retention

Make informed, data-driven decisions for sustainable growth

Ultimately, the goal is to enable stakeholders to design, launch, and manage apps that achieve higher visibility, stronger adoption, and long-term success in the competitive Android marketplace.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

# Import Libraries

# ---------------------------------------------
# PLAY STORE APP REVIEW ANALYSIS - LIBRARIES
# ---------------------------------------------

# Data manipulation & numerical computing
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Statistical analysis
from scipy import stats

# NLP for text analysis (optional but required for review insights)
from sklearn.feature_extraction.text import CountVectorizer

# Plot styling
plt.style.use('seaborn-v0_8')
sns.set_palette("Set2")

# System & warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully.")

### Dataset Loading

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print(f'User uploaded file "{fn}" with length {len(uploaded[fn])} bytes')


In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print(f'User uploaded file "{fn}" with length {len(uploaded[fn])} bytes')


In [None]:
# Load Dataset
# STEP 2: DATASET LOADING
# -------------------------------------------------------

# File paths (adjust if needed)
apps_filepath = "Play Store Data.csv"
reviews_filepath = "User Reviews.csv"

# Function to safely load CSV files with error handling
def load_csv_file(filepath):
    """
    Loads a CSV file safely with basic exception handling.
    Returns a pandas DataFrame if successful.
    """
    try:
        df = pd.read_csv(filepath)
        print(f"SUCCESS: '{filepath}' loaded successfully.")
        print(f"Shape: {df.shape}")
        return df

    except FileNotFoundError:
        print(f"ERROR: File '{filepath}' not found. Please check the path.")

    except pd.errors.EmptyDataError:
        print(f"ERROR: File '{filepath}' is empty.")

    except pd.errors.ParserError:
        print(f"ERROR: Parsing error while reading '{filepath}'.")

    except Exception as e:
        print(f"Unexpected error while reading '{filepath}': {e}")

# Load datasets
apps_df = load_csv_file(apps_filepath)
reviews_df = load_csv_file(reviews_filepath)

# Display first few rows
print("\nTop 5 rows from Apps Dataset:")
display(apps_df.head())

print("\nTop 5 rows from User Reviews Dataset:")
display(reviews_df.head())


### Dataset First View

In [None]:
# Dataset First Look

print("First 5 rows of Play Store Apps Dataset:")
display(apps_df.head())

print("\nLast 5 rows of Play Store Apps Dataset:")
display(apps_df.tail())

print("\nShape of Apps Dataset (Rows, Columns):")
print(apps_df.shape)

print("\nColumn Names:")
print(list(apps_df.columns))

print("First 5 rows of User Reviews Dataset:")
display(reviews_df.head())

print("\nLast 5 rows of User Reviews Dataset:")
display(reviews_df.tail())

print("\nShape of Reviews Dataset (Rows, Columns):")
print(reviews_df.shape)

print("\nColumn Names:")
print(list(reviews_df.columns))



### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

# -------------------------------------------------------
# DATASET SHAPE – PLAY STORE APPS DATA
# -------------------------------------------------------

apps_rows, apps_cols = apps_df.shape

print("Play Store Apps Dataset:")
print(f"Number of Rows    : {apps_rows}")
print(f"Number of Columns : {apps_cols}")



# -------------------------------------------------------
# DATASET SHAPE – USER REVIEWS DATA
# -------------------------------------------------------

reviews_rows, reviews_cols = reviews_df.shape

print("User Reviews Dataset:")
print(f"Number of Rows    : {reviews_rows}")
print(f"Number of Columns : {reviews_cols}")

Dataset Information

In [None]:
# Dataset Info

# -------------------------------------------------------
# DATASET INFORMATION
# -------------------------------------------------------

print("Apps Dataset Info:")
apps_df.info()

print("\nUser Reviews Dataset Info:")
reviews_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

# -------------------------------------------------------
# DUPLICATE VALUES CHECK – APPS DATASET
# -------------------------------------------------------

apps_duplicate_count = apps_df.duplicated().sum()

print(f"Total duplicate rows in Apps dataset: {apps_duplicate_count}")

# -------------------------------------------------------
# DUPLICATE APP NAMES CHECK
# -------------------------------------------------------

duplicate_apps = apps_df['App'].duplicated().sum()

print(f"Duplicate App names in Apps dataset: {duplicate_apps}")

# Display sample duplicate apps
apps_df[apps_df['App'].duplicated(keep=False)].sort_values('App').head(10)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

# MISSING VALUES COUNT – APPS DATASET

missing_apps = apps_df.isnull().sum()

print("Missing Values Count – Apps Dataset:\n")
display(missing_apps[missing_apps > 0].sort_values(ascending=False))

# MISSING VALUES COUNT – REVIEWS DATASET

missing_reviews = reviews_df.isnull().sum()

print("Missing Values Count – Reviews Dataset:\n")
display(missing_reviews[missing_reviews > 0].sort_values(ascending=False))



In [None]:
# Visualizing the missing values

# VISUALIZING MISSING VALUES – APPS DATASET

missing_apps = apps_df.isnull().sum()
missing_apps = missing_apps[missing_apps > 0]

plt.figure(figsize=(10, 5))
missing_apps.sort_values(ascending=False).plot(
    kind='bar',
    color='steelblue'
)

plt.title("Missing Values Count per Column (Apps Dataset)")
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

# VISUALIZING MISSING VALUES – REVIEWS DATASET

missing_reviews = reviews_df.isnull().sum()
missing_reviews = missing_reviews[missing_reviews > 0]

plt.figure(figsize=(8, 4))
missing_reviews.sort_values(ascending=False).plot(
    kind='bar',
    color='darkorange'
)

plt.title("Missing Values Count per Column (Reviews Dataset)")
plt.xlabel("Columns")
plt.ylabel("Number of Missing Values")
plt.xticks(rotation=0)
plt.tight_layout()
plt.show()

### What did you know about your dataset?

**Understanding of the Dataset**

The project uses two interconnected datasets related to Google Play Store applications, each serving a distinct analytical purpose.Answer Here


**1. Play Store Apps Dataset**

This dataset contains app-level metadata for Android applications available on the Google Play Store.

Key characteristics:

Each row represents one application

Contains 10,000+ apps across multiple categories

Includes both numerical and categorical features

Important columns and their meaning:

App → Name of the application

Category → App category (e.g., Games, Business, Education)

Rating → Average user rating (0–5 scale)

Reviews → Total number of user reviews

Installs → Approximate install count (given in ranges)

Size → App size (includes values like Varies with device)

Type → Free or Paid

Price → Cost of the app (for paid apps)

Content Rating → Age suitability

Genres → Sub-category classification

Data quality observations:

Rating contains missing values

Reviews, Installs, and Price are stored as strings and need conversion

Duplicate app names exist, likely due to multiple versions or updates


**2. User Reviews Dataset**

This dataset contains individual user reviews and their associated sentiment scores.

Key characteristics:

Each row represents one user review

Much larger than the apps dataset (60,000+ records)

Enables sentiment-based insights

Important columns:

App → Name of the application being reviewed

Translated_Review → User review text in English

Sentiment → Review sentiment (Positive / Negative / Neutral)

Sentiment_Polarity → Numerical sentiment score (–1 to +1)

Sentiment_Subjectivity → Measure of opinion vs fact

Data quality observations:

Some reviews have missing text or sentiment values

Suitable for aggregation at the app level


**3. Relationship Between the Datasets**

Both datasets are linked using the App name

User reviews can be aggregated to compute:

Average sentiment polarity per app

Percentage of positive/negative reviews

This enables analysis of how user sentiment influences app ratings and installs


**4. Business Relevance of the Dataset**

The combined datasets help answer key business questions:

What factors drive high app ratings and installs?

Which categories perform better in terms of user satisfaction?

How does user sentiment impact app success?

What improvements can developers make to increase engagement?

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

# UNDERSTANDING VARIABLES – APPS DATASET
# Create a summary table for column names and data types
apps_columns_info = pd.DataFrame({
    "Column Name": apps_df.columns,
    "Data Type": apps_df.dtypes.values
})

display(apps_columns_info)


# UNDERSTANDING VARIABLES – REVIEWS DATASET
reviews_columns_info = pd.DataFrame({
    "Column Name": reviews_df.columns,
    "Data Type": reviews_df.dtypes.values
})

display(reviews_columns_info)


In [None]:
# Dataset Describe

# DATASET DESCRIBE – PLAY STORE APPS

print("Statistical Summary of Play Store Apps Dataset:")
display(apps_df.describe().T)


# DATASET DESCRIBE – USER REVIEWS

print("Statistical Summary of User Reviews Dataset:")
display(reviews_df.describe().T)

### Variables Description

A variable is a symbolic name used to store and manage data in memory. Variables act as labels for data objects and allow you to reference and manipulate values throughout your program without needing to know the underlying memory address. Answer Here

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.


# UNIQUE VALUES COUNT – APPS DATASET


apps_unique_values = pd.DataFrame({
    "Column Name": apps_df.columns,
    "Unique Values Count": [apps_df[col].nunique() for col in apps_df.columns]
})

apps_unique_values = apps_unique_values.sort_values(
    by="Unique Values Count", ascending=False
).reset_index(drop=True)

display(apps_unique_values)




# UNIQUE VALUES COUNT – REVIEWS DATASET


reviews_unique_values = pd.DataFrame({
    "Column Name": reviews_df.columns,
    "Unique Values Count": [reviews_df[col].nunique() for col in reviews_df.columns]
})

reviews_unique_values = reviews_unique_values.sort_values(
    by="Unique Values Count", ascending=False
).reset_index(drop=True)

display(reviews_unique_values)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# CREATE WORKING COPIES OF DATASETS
apps = apps_df.copy()
reviews = reviews_df.copy()

# Remove exact duplicate rows
apps.drop_duplicates(inplace=True)

# Convert Reviews to numeric for sorting
apps['Reviews'] = pd.to_numeric(apps['Reviews'], errors='coerce')

# Keep app record with highest number of reviews
apps = apps.sort_values('Reviews', ascending=False)\
           .drop_duplicates(subset='App', keep='first')

# Remove duplicate reviews
reviews.drop_duplicates(inplace=True)

# Drop apps with missing Rating (cannot be used for rating analysis)
apps = apps[apps['Rating'].notnull()]

# Fill missing Size with placeholder
apps['Size'].fillna('Varies with device', inplace=True)

# Fill missing Type with most frequent value
apps['Type'].fillna(apps['Type'].mode()[0], inplace=True)

# Fill missing Price with 0 (Free apps)
apps['Price'].fillna('0', inplace=True)

# Drop reviews with missing review text or sentiment polarity
reviews.dropna(subset=['Translated_Review', 'Sentiment_Polarity'], inplace=True)

apps['Installs'] = apps['Installs'].str.replace('+', '', regex=False)\
                                   .str.replace(',', '', regex=False)
apps['Installs'] = pd.to_numeric(apps['Installs'], errors='coerce')

apps['Price'] = apps['Price'].str.replace('$', '', regex=False)
apps['Price'] = pd.to_numeric(apps['Price'], errors='coerce')

def convert_size(size):
    if 'M' in size:
        return float(size.replace('M', ''))
    elif 'k' in size:
        return float(size.replace('k', '')) / 1024
    else:
        return np.nan

apps['Size_MB'] = apps['Size'].apply(convert_size)

# Log transform installs to reduce skewness
apps['Log_Installs'] = np.log1p(apps['Installs'])

# Flag for paid apps
apps['Is_Paid'] = apps['Price'].apply(lambda x: 1 if x > 0 else 0)

# Reviews per install ratio
apps['Reviews_per_Install'] = apps['Reviews'] / apps['Installs']

# Aggregate sentiment metrics per app
review_agg = reviews.groupby('App').agg(
    Avg_Sentiment_Polarity=('Sentiment_Polarity', 'mean'),
    Positive_Review_Ratio=('Sentiment', lambda x: (x == 'Positive').mean()),
    Review_Count=('Sentiment', 'count')
).reset_index()

# Merge sentiment data with apps dataset
apps = apps.merge(review_agg, on='App', how='left')

print("Final Apps Dataset Shape:", apps.shape)
print("Final Reviews Dataset Shape:", reviews.shape)

print("\nMissing Values in Apps Dataset:")
display(apps.isnull().sum()[apps.isnull().sum() > 0])

print("\nMissing Values in Reviews Dataset:")
display(reviews.isnull().sum()[reviews.isnull().sum() > 0])



### What all manipulations have you done and insights you found?

# **Data Manipulations Performed & Key Insights Derived**



# Dataset Loading and Validation

Manipulations

Loaded two datasets: Play Store Apps and User Reviews

Verified row and column counts

Checked schema consistency and data types

Validated essential columns required for analysis

Insights

Apps dataset provides app-level metadata

Reviews dataset provides granular user sentiment data

Both datasets are linked through the App column, enabling integrated analysis





# **Duplicate Handling**

Manipulations

Removed exact duplicate rows from both datasets

Identified duplicate App names in the apps dataset

Retained only one record per app based on highest review count

Insights

Duplicate apps were caused by multiple versions or repeated scraping

Removing duplicates prevented inflated installs and ratings

Ensured each app is represented uniquely in the analysis





# **Missing Values Identification & Treatment**

Manipulations

Identified missing values across all columns

Removed rows with missing ratings (critical KPI)

Filled missing Price values with zero (free apps)

Filled missing categorical values using mode

Dropped reviews with missing text or sentiment polarity

Insights

Missing ratings mostly belonged to newly launched apps

Price missing values primarily indicated free apps

Removing invalid reviews improved sentiment reliability




# **Data Type Corrections**

Manipulations

Converted Reviews, Installs, and Price from string to numeric

Removed special characters (+, ,, $) before conversion

Converted Size into numerical megabytes (MB)

Insights

Numeric conversion enabled accurate aggregation and visualization

Raw string formats were misleading and unsuitable for analysis

App size became a usable feature for performance comparison





# **Feature Engineering**

Manipulations

Created Log_Installs to handle skewed install distribution

Added Is_Paid flag to distinguish free vs paid apps

Calculated Reviews_per_Install as an engagement metric

Created size and rating buckets for segmentation

Insights

Install counts were highly right-skewed

Free apps dominate the Play Store ecosystem

Highly engaged apps show higher review-to-install ratios





# **Sentiment Aggregation**

Manipulations

Aggregated review-level sentiment to app-level metrics:

Average sentiment polarity

Positive review ratio

Review count

Merged aggregated sentiment data with app metadata

Insights

Apps with higher sentiment polarity tend to have better ratings

Negative sentiment clusters highlight stability and UX issues

Sentiment adds explanatory power beyond star ratings





# **Final Dataset Validation**

Manipulations

Rechecked dataset shape and missing values

Ensured datasets are analysis-ready

Exported cleaned datasets for reuse

Insights

Final datasets are consistent, clean, and reliable

Ready for EDA, visualization, and ML modeling

Overall Business Insights Gained

App success is influenced by category, sentiment, and engagement

Ratings alone do not fully explain user satisfaction

User sentiment provides early warning signals for app performance issues

Data cleaning significantly improves insight accuracy

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

# CHART 1: DISTRIBUTION OF APP RATINGS (UNIVARIATE ANALYSIS)
plt.figure(figsize=(8, 5))

sns.histplot(apps['Rating'], bins=20, kde=True)

plt.title("Distribution of App Ratings on Google Play Store")
plt.xlabel("App Rating")
plt.ylabel("Number of Apps")
plt.show()

##### 1. Why did you pick the specific chart?

I chose this histogram because it clearly shows the distribution and spread of app ratings, which is a key performance indicator for app success.
It helps identify common rating ranges, skewness, and outliers that directly impact user perception and app visibility.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Most apps have ratings between 4.0 and 4.5, indicating generally high user satisfaction.
Apps with ratings below 3.5 are relatively few and are likely to face lower user trust and engagement.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help set a clear quality benchmark (ratings ≥ 4.0) that developers should target to improve visibility and user acquisition.
Apps with ratings below 3.5 indicate negative growth potential, as low ratings reduce user trust, installs, and Play Store ranking.

#### Chart - 2

In [None]:
# CHART 2: CATEGORY-WISE APP COUNT (UNIVARIATE ANALYSIS)

plt.figure(figsize=(12, 6))

apps['Category'].value_counts().head(15).plot(
    kind='bar',
    color='teal'
)

plt.title("Top 15 App Categories by Number of Apps")
plt.xlabel("App Category")
plt.ylabel("Number of Apps")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

##### 1. Why did you pick the specific chart?

I chose a bar chart because it clearly compares the number of apps across different categories.
It helps quickly identify which categories are most saturated and competitive in the Play Store.Answer Here.

##### 2. What is/are the insight(s) found from the chart?

A few categories such as Family, Game, and Tools dominate the Play Store in terms of app count.
This indicates high competition in these categories, while less populated categories may offer better entry opportunities.Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help developers identify less crowded categories where new apps can gain visibility more easily, leading to positive business impact.
Highly saturated categories signal negative growth risk due to intense competition, making it harder for new apps to stand out and acquire users.

#### Chart - 3

In [None]:
# Chart - 3 visualization code

# CHART 3: RELATIONSHIP BETWEEN APP RATINGS AND INSTALLS
# (BIVARIATE ANALYSIS: Numerical vs Numerical)

plt.figure(figsize=(8, 6))

sns.scatterplot(
    data=apps,
    x='Rating',
    y='Installs',
    alpha=0.6
)

##### 1. Why did you pick the specific chart?

I chose a scatter plot because it effectively shows the relationship between two numerical variables, app ratings and install counts.
It helps identify trends, patterns, and whether higher ratings are associated with higher app adoption.

##### 2. What is/are the insight(s) found from the chart?

Apps with higher ratings generally tend to have higher install counts, indicating a positive relationship between user satisfaction and adoption.
However, some apps achieve high installs despite moderate ratings, suggesting the influence of other factors such as brand value or category demand.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight highlights that improving app quality and user experience to achieve higher ratings can directly drive higher installs and business growth.
Apps with low ratings but high installs indicate a risk of future negative growth, as poor user satisfaction can eventually reduce retention and long-term adoption.

#### Chart - 4

In [None]:
# Chart - 4 visualization code

# CHART 4: CATEGORY-WISE AVERAGE APP RATING
# (BIVARIATE ANALYSIS: Numerical vs Categorical)

plt.figure(figsize=(12, 6))

category_rating = (
    apps.groupby('Category')['Rating']
    .mean()
    .sort_values(ascending=False)
    .head(15)
)

category_rating.plot(
    kind='bar',
    color='slateblue'
)

plt.title("Top 15 App Categories by Average Rating")
plt.xlabel("App Category")
plt.ylabel("Average Rating")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()








##### 1. Why did you pick the specific chart?

I chose a bar chart because it clearly compares the average ratings across different app categories.
It helps identify which categories deliver better user satisfaction and performance.

##### 2. What is/are the insight(s) found from the chart?

Certain categories such as Education, Health & Fitness, and Books show higher average ratings, indicating stronger user satisfaction.
Categories with comparatively lower average ratings may require quality improvements to enhance user experience and retention.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help developers prioritize categories with higher user satisfaction, increasing the chances of better adoption and long-term success.
Categories with lower average ratings signal potential negative growth, as poor user experience can lead to lower retention, weaker reviews, and reduced installs.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

# CHART 5: APP RATING DISTRIBUTION BY TYPE (FREE vs PAID)
# (BIVARIATE ANALYSIS: Numerical vs Categorical)

plt.figure(figsize=(8, 6))

sns.boxplot(
    data=apps,
    x='Type',
    y='Rating'
)

plt.title("App Rating Distribution: Free vs Paid Apps")
plt.xlabel("App Type")
plt.ylabel("Rating")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot because it effectively compares the distribution, spread, and median ratings between free and paid apps.
It helps identify differences in user satisfaction and the presence of outliers across app types.

##### 2. What is/are the insight(s) found from the chart?

Paid apps generally show slightly higher median ratings than free apps, indicating better perceived quality.
Free apps display wider rating variability, suggesting inconsistent user experiences across offerings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight supports a positive business impact by showing that investing in quality for paid apps can lead to higher user satisfaction and trust.
For free apps, wider rating variability indicates potential negative growth if issues like ads or performance problems are not addressed, as these can lower retention and reviews.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

# CHART 6: CATEGORY-WISE MEDIAN INSTALLS
# (BIVARIATE ANALYSIS: Numerical vs Categorical)

plt.figure(figsize=(12, 6))

category_installs = (
    apps.groupby('Category')['Installs']
    .median()
    .sort_values(ascending=False)
    .head(15)
)

category_installs.plot(
    kind='bar',
    color='darkgreen'
)

plt.title("Top 15 App Categories by Median Number of Installs")
plt.xlabel("App Category")
plt.ylabel("Median Installs")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()








##### 1. Why did you pick the specific chart?

I chose a bar chart because it allows a clear comparison of median install counts across different app categories.
Using the median reduces the effect of extreme outliers and provides a more representative view of typical category performance.

##### 2. What is/are the insight(s) found from the chart?

Categories such as Games, Communication, and Social have the highest median installs, indicating strong user demand.
Categories with lower median installs suggest niche usage or limited market reach.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help developers target high-demand categories where user adoption potential is strong, leading to positive business impact.
Categories with consistently low median installs indicate negative growth risk, as limited demand can restrict user acquisition and revenue potential.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

# CHART 7: AVERAGE SENTIMENT POLARITY VS APP RATING
# (BIVARIATE ANALYSIS: Numerical vs Numerical)

plt.figure(figsize=(8, 6))

sns.scatterplot(
    data=apps,
    x='Avg_Sentiment_Polarity',
    y='Rating',
    alpha=0.6
)

plt.title("Relationship Between User Sentiment Polarity and App Rating")
plt.xlabel("Average Sentiment Polarity")
plt.ylabel("App Rating")
plt.show()


##### 1. Why did you pick the specific chart?

I chose a scatter plot because it clearly shows the relationship between user sentiment polarity and app ratings, both being numerical variables.
It helps assess whether more positive user sentiment aligns with higher app ratings.

##### 2. What is/are the insight(s) found from the chart?

Apps with higher average sentiment polarity generally have higher ratings, indicating strong alignment between user sentiment and star ratings.
Negative or neutral sentiment is associated with lower ratings, highlighting user dissatisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight helps businesses focus on improving user sentiment through better stability, features, and support, which can positively influence ratings and installs.
Apps with consistently negative sentiment indicate a risk of negative growth, as dissatisfied users are more likely to give low ratings, leave negative reviews, and reduce future adoption.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

# CHART 8: CORRELATION HEATMAP OF NUMERICAL VARIABLES
# (MULTIVARIATE ANALYSIS)

num_cols = [
    'Rating',
    'Installs',
    'Reviews',
    'Price',
    'Size_MB',
    'Log_Installs',
    'Avg_Sentiment_Polarity'
]

# Compute correlation matrix
corr_matrix = apps[num_cols].corr(method='spearman')

plt.figure(figsize=(10, 6))
sns.heatmap(
    corr_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    linewidths=0.5
)

plt.title("Correlation Heatmap of Key Numerical Variables")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a correlation heatmap because it allows simultaneous comparison of relationships among multiple numerical variables.
It helps quickly identify strong positive or negative correlations that influence app performance and business outcomes.

##### 2. What is/are the insight(s) found from the chart?

Ratings show a positive correlation with sentiment polarity, indicating that happier users tend to give higher ratings.
Installs and reviews are strongly correlated, suggesting that higher app adoption naturally drives more user feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help businesses prioritize factors like user satisfaction and engagement that strongly influence ratings and installs, driving positive growth.
Weak or negative correlations (such as low sentiment or poor ratings) indicate potential negative growth, as declining user satisfaction can reduce installs and long-term retention.

#### Chart - 9

In [None]:
# Chart - 9 visualization code

plt.figure(figsize=(8, 5))

sns.histplot(
    apps['Size_MB'].dropna(),
    bins=30,
    kde=True,
    color='coral'
)

plt.title("Distribution of App Sizes (in MB)")
plt.xlabel("App Size (MB)")
plt.ylabel("Number of Apps")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram because it clearly shows the distribution and spread of app sizes across the Play Store.
It helps identify common size ranges and detect unusually large apps that may affect user downloads.

##### 2. What is/are the insight(s) found from the chart?

Most apps are relatively small in size, with a high concentration below 50 MB.
Very large apps are fewer, suggesting potential download and storage concerns for users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insight helps developers optimize app size to improve download rates, especially for users with limited storage or slower networks.
Very large app sizes can lead to negative growth by increasing uninstall rates and discouraging downloads due to storage and data usage concerns.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

plt.figure(figsize=(8, 6))

sns.scatterplot(
    data=apps,
    x='Size_MB',
    y='Installs',
    alpha=0.6
)

plt.title("Relationship Between App Size and Number of Installs")
plt.xlabel("App Size (MB)")
plt.ylabel("Number of Installs")
plt.yscale('log')  # Log scale to handle skewness in installs
plt.show()


##### 1. Why did you pick the specific chart?

I chose a scatter plot because it effectively shows the relationship between two numerical variables: app size and install count.
It helps identify whether larger app sizes discourage user adoption or have no significant impact on installs.

##### 2. What is/are the insight(s) found from the chart?

Smaller apps generally tend to have higher install counts, indicating better user adoption.
Larger apps show more variability and often lower installs, suggesting size can be a barrier to downloads.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help developers optimize app size to improve install rates and reach a broader user base.
Larger app sizes indicate a risk of negative growth, as increased storage and data requirements can discourage downloads and lead to lower adoption.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

plt.figure(figsize=(8, 6))

sns.scatterplot(
    data=apps,
    x='Reviews',
    y='Rating',
    alpha=0.6
)

plt.title("Relationship Between Number of Reviews and App Rating")
plt.xlabel("Number of Reviews")
plt.ylabel("App Rating")
plt.xscale('log')  # Log scale to handle skewness in review counts
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot because it clearly visualizes the relationship between the number of reviews and app ratings, both being numerical variables.
Using a log scale helps reveal patterns across a wide range of review counts without distortion from extreme values.

##### 2. What is/are the insight(s) found from the chart?

Apps with a higher number of reviews tend to have more stable and reliable ratings.
Apps with very few reviews show wide rating variation, indicating lower confidence in user feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights encourage developers to increase user engagement and feedback, as more reviews lead to more trustworthy ratings and higher credibility.
Apps with very few reviews risk negative growth, since low review volume reduces user trust and can limit installs and visibility.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

plt.figure(figsize=(8, 5))

sns.histplot(
    apps['Reviews_per_Install'].dropna(),
    bins=30,
    kde=True,
    color='mediumseagreen'
)

plt.title("Distribution of Reviews per Install")
plt.xlabel("Reviews per Install")
plt.ylabel("Number of Apps")
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram because it effectively shows the distribution of engagement levels measured by reviews per install.
It helps identify how actively users provide feedback after installing an app.

##### 2. What is/are the insight(s) found from the chart?

Most apps have a very low reviews-per-install ratio, indicating that only a small fraction of users leave reviews.
A small subset of apps shows higher engagement, suggesting stronger user involvement or prompting strategies.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, these insights help developers focus on improving in-app engagement and feedback prompts to increase reviews and visibility.
Apps with extremely low reviews per install indicate weak user engagement, which can negatively impact discoverability and long-term growth.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

plt.figure(figsize=(8, 6))

sentiment_by_type = (
    reviews.merge(apps[['App', 'Type']], on='App', how='inner')
           .groupby(['Type', 'Sentiment'])
           .size()
           .reset_index(name='Count')
)

sns.barplot(
    data=sentiment_by_type,
    x='Type',
    y='Count',
    hue='Sentiment'
)

plt.title("User Sentiment Distribution by App Type")
plt.xlabel("App Type")
plt.ylabel("Number of Reviews")
plt.legend(title="Sentiment")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a grouped bar chart because it clearly compares sentiment categories across free and paid apps.
It helps understand how user satisfaction differs by app monetization type.

##### 2. What is/are the insight(s) found from the chart?

Free apps tend to receive a higher volume of both positive and negative reviews, indicating wider usage and mixed user experiences.
Paid apps generally show a higher proportion of positive sentiment, suggesting better perceived quality and user satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the insights help businesses improve monetization strategy by showing that paid apps often deliver higher user satisfaction and trust.
Free apps with a high share of negative sentiment indicate potential negative growth, as issues like ads or performance problems can drive poor reviews and user churn.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

numerical_features = [
    'Rating',
    'Installs',
    'Reviews',
    'Price',
    'Size_MB',
    'Log_Installs',
    'Avg_Sentiment_Polarity'
]

# Compute Spearman correlation (robust to skewed data)
correlation_matrix = apps[numerical_features].corr(method='spearman')

# Plot heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt=".2f",
    cmap="coolwarm",
    linewidths=0.5
)

plt.title("Correlation Heatmap of Key App Metrics")
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a correlation heatmap because it provides a clear, simultaneous view of relationships among multiple numerical variables.
It helps quickly identify which factors are most strongly associated with app performance and user engagement.

##### 2. What is/are the insight(s) found from the chart?

Ratings show a strong positive correlation with average sentiment polarity, confirming that happier users give higher ratings.
Installs and reviews are highly correlated, indicating that increased adoption naturally drives more user feedback.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

# Select key numerical features for pairwise comparison
pairplot_features = [
    'Rating',
    'Log_Installs',
    'Reviews',
    'Price',
    'Size_MB',
    'Avg_Sentiment_Polarity'
]

# Create pair plot
sns.pairplot(
    apps[pairplot_features].dropna(),
    diag_kind='kde',
    corner=True
)

plt.suptitle("Pair Plot of Key App Performance Metrics", y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pair plot because it allows simultaneous exploration of pairwise relationships among multiple numerical variables.
It helps quickly identify trends, correlations, and potential outliers across key app performance metrics.

##### 2. What is/are the insight(s) found from the chart?

Clear positive relationships are visible between log installs and reviews, and between ratings and sentiment polarity, confirming consistency across metrics.
Outliers and non-linear patterns also emerge, indicating that multiple factors jointly influence app performance rather than a single variable.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

To achieve the business objective of improving app engagement and success on the Play Store, I recommend the following actions:

Focus on user experience and quality: Prioritize stability, performance, and intuitive design to maintain ratings above 4.0, which strongly influences installs and visibility.

Leverage user sentiment insights: Actively monitor reviews and sentiment to identify pain points early and resolve issues before they impact ratings and retention.

Optimize app size and performance: Keep app size minimal to reduce download friction and uninstall rates, especially in low-bandwidth markets.

Choose categories strategically: Target less saturated categories with strong demand to reduce competition and improve discoverability.

Drive user engagement: Encourage satisfied users to leave reviews, increasing credibility and improving store ranking.Answer Here.

# **Conclusion**

This project analyzed Google Play Store app data and user reviews to understand the key drivers of app engagement and success. Through systematic data wrangling, structured exploratory analysis, and insightful visualizations, clear patterns emerged around ratings, installs, categories, app size, and user sentiment.

The analysis shows that high app quality and positive user sentiment are the strongest contributors to success, as apps with better ratings and sentiment consistently achieve higher installs and engagement. Category selection and competition intensity play a critical role in growth potential, while app size and performance optimization significantly influence download behavior. Additionally, user engagement signals such as review volume and sentiment provide early indicators of future performance.

Overall, the insights derived from this analysis can help developers and businesses make data-driven decisions to improve product quality, optimize market positioning, and enhance user satisfaction—ultimately driving sustainable growth and long-term success on the Play Store.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***