# **Project Name**    -

> Play Store App Review Analysis





##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

The Google Play Store is a dynamic ecosystem hosting millions of applications, each competing for user attention, high ratings, and long-term retention. For developers and stakeholders, success is often hidden within user feedback and performance metrics. The primary objective of this project is to perform a deep-dive Exploratory Data Analysis (EDA) on a Play Store dataset to identify the key factors that drive app popularity and user satisfaction.

#DataSet  Used
Play Store Data: Contains core attributes for thousands of apps, including Category, Rating, Reviews, Size, Installs, Type (Free/Paid), and Price.

User Reviews: Contains qualitative data such as Translated_Review, Sentiment (Positive, Negative, Neutral), Sentiment_Polarity, and Sentiment_Subjectivity
#Methodology & Steps
1.Data Cleaning and Preprocessing
2.Univariate Analysis
3.Bivariate and Multivariate Analysis
#conclusion
This EDA project provides a comprehensive view of the Android app market. It demonstrates that while the category and price-type (Free) are significant drivers of installations, long-term success—as reflected in ratings and sentiment—is heavily dependent on maintaining a lightweight app and responding to user feedback.

# **GitHub Link -**

“I uploaded my complete EDA project to GitHub, including datasets, notebooks, visualizations, and documentation. This makes the project version-controlled, easy to review, and reusable for future improvements.”

# **Problem Statement**


**Despite having a high volume of user reviews, the company currently lacks a systematic way to convert thousands of qualitative comments into actionable technical requirements. Without a data-driven approach to categorize and quantify user dissatisfaction, development teams are "guessing" which bugs to fix. This disconnect results in persistent negative sentiment, stagnant app ratings, and a high churn rate, preventing the app from reaching its growth potential.**

#### **Define Your Business Objective?**

"To optimize app performance and user satisfaction by quantifying the 'Sentiment-to-Rating' gap, specifically identifying the top 3 technical and monetization triggers—Ads, Bugs, and Pricing—that cause a decline in Play Store rankings."

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Optional: To make plots look cleaner in Colab
%matplotlib inline

# Optional: To ignore warnings about depreciated functions
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Import Libraries
# 1. Data Manipulation & Numerical Analysis
import pandas as pd
import numpy as np

# 2. Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# 3. Natural Language Processing (NLP)
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from textblob import TextBlob  # For sentiment polarity and subjectivity

# 4. Text Cleaning & Regex
import re
import string

# 5. Settings & Configurations
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')


### Dataset First View

In [None]:
# Dataset First Look
apps_df.head()



In [None]:
reviews_df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
apps_df.shape
reviews_df.shape


### Dataset Information

In [None]:
# Dataset Info
apps_df.info()
reviews_df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
apps_df.duplicated().sum()
reviews_df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
apps_df.isnull().sum()
reviews_df.isnull().sum()


In [None]:
# Visualizing the missing values
#Missing Values Visualization for Play Store Apps Dataset
plt.figure(figsize=(10, 6))
sns.heatmap(apps_df.isnull(), cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap – Play Store Apps Dataset")
plt.show()

#Missing Values Visualization for User Reviews Dataset
plt.figure(figsize=(10, 6))
sns.heatmap(reviews_df.isnull(), cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap – User Reviews Dataset")
plt.show()



### What did you know about your dataset?

After performing an initial exploration of the dataset, I understood that it consists of two main parts: Play Store application data and user reviews data. The Play Store dataset contains information about individual apps such as category, rating, number of installs, price, type (free or paid), and content rating. This data helps in analyzing overall app performance and ranking behavior.

The user reviews dataset provides textual feedback along with sentiment labels and sentiment polarity scores. This dataset is useful for understanding user opinions and measuring user satisfaction beyond numerical ratings.

During exploration, I observed that some columns contain missing values, especially in ratings and review-related fields, and there were a few duplicate records that needed to be removed. I also noticed that most apps are free, with ratings generally concentrated between 4.0 and 4.5. However, sentiment polarity does not always align with app ratings, indicating a sentiment-to-rating gap.

Overall, the dataset is suitable for exploratory data analysis and allows meaningful insights into how technical issues like bugs, monetization factors such as ads and pricing, and user sentiment impact app ratings and Play Store rankings.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
apps_df.columns
reviews_df.columns


In [None]:
# Dataset Describe
apps_df.describe()
reviews_df.describe()

### Variables Description

The variables in the dataset represent both quantitative and qualitative aspects of app performance and user feedback. App-related variables provide insights into usage, pricing, and popularity, while review-related variables capture user sentiment and satisfaction levels.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

# app review dataset
for col in apps_df.columns:
      print(col, ":", apps_df[col].nunique())

# user review dataset

for col in reviews_df.columns:
    print(col, ":", reviews_df[col].nunique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
import pandas as pd
import numpy as np

# 1. LOAD DATASETS
apps_df = pd.read_csv('Play Store Data.csv')
reviews_df = pd.read_csv('User Reviews.csv')

# 2. CHECK COLUMN NAMES (VERY IMPORTANT)
print("Apps columns:", apps_df.columns.tolist())
print("Reviews columns:", reviews_df.columns.tolist())

# -------------------------------
# 3. CLEAN PLAY STORE DATA
# -------------------------------

# Remove duplicate apps safely
if 'App' in apps_df.columns:
    apps_df = apps_df.drop_duplicates(subset='App')

# Clean Installs
if 'Installs' in apps_df.columns:
    apps_df['Installs'] = (
        apps_df['Installs']
        .astype(str)
        .str.replace('+', '', regex=False)
        .str.replace(',', '', regex=False)
    )
    apps_df['Installs'] = pd.to_numeric(apps_df['Installs'], errors='coerce')

# Clean Price
if 'Price' in apps_df.columns:
    apps_df['Price'] = (
        apps_df['Price']
        .astype(str)
        .str.replace('$', '', regex=False)
    )
    apps_df['Price'] = pd.to_numeric(apps_df['Price'], errors='coerce')

# Clean Size
def convert_size(size):
    if isinstance(size, str):
        if size.endswith('M'):
            return float(size.replace('M', ''))
        elif size.endswith('k'):
            return float(size.replace('k', '')) / 1024
        elif 'Varies' in size:
            return np.nan
    return size

if 'Size' in apps_df.columns:
    apps_df['Size'] = apps_df['Size'].apply(convert_size)

# Convert Reviews
if 'Reviews' in apps_df.columns:
    apps_df['Reviews'] = pd.to_numeric(apps_df['Reviews'], errors='coerce')

# Fill missing Ratings
if 'Rating' in apps_df.columns:
    apps_df['Rating'] = pd.to_numeric(apps_df['Rating'], errors='coerce')
    apps_df['Rating'] = apps_df['Rating'].fillna(apps_df['Rating'].median())

# -------------------------------
# 4. CLEAN USER REVIEWS DATA
# -------------------------------

required_cols = ['App', 'Translated_Review', 'Sentiment', 'Sentiment_Polarity']
existing_cols = [c for c in required_cols if c in reviews_df.columns]

reviews_df = reviews_df.dropna(subset=existing_cols)

# -------------------------------
# 5. MERGE DATASETS
# -------------------------------

if 'App' in apps_df.columns and 'App' in reviews_df.columns:
    merged_df = pd.merge(apps_df, reviews_df, on='App', how='inner')
else:
    merged_df = apps_df.copy()

# -------------------------------
# 6. FINAL OUTPUT
# -------------------------------

print("\nData Wrangling Completed Successfully")
print("Apps Data Shape:", apps_df.shape)
print("Reviews Data Shape:", reviews_df.shape)
print("Merged Data Shape:", merged_df.shape)
print("\nMerged Data Sample:")
print(merged_df.head())


### What all manipulations have you done and insights you found?


**1. Data Manipulations / Cleaning Steps**

1.Removed duplicates:

Ensured each app is unique to avoid skewed analysis.

2.Cleaned Installs:

Removed ‘+’ and commas, then converted to numeric for analysis.

3.Cleaned Price:

Removed ‘$’ and converted it to float for calculations.

4.Standardized Size:

Converted all sizes to MB (k → MB, removed 'M').

Handled values like 'Varies with device' by setting them as NaN.

5.Converted Reviews and Rating to numeric:

Ensured numerical operations like average, correlation, and gap calculation are possible.

Filled missing ratings with median.

6.Cleaned user reviews:

Dropped rows with missing review text or missing sentiment.

7.Merged datasets:

Combined app details with corresponding user reviews for a comprehensive analysis.

8.Prepared new feature (optional for analysis):

Can calculate Sentiment-to-Rating Gap to identify mismatch between user sentiment and app rating.

**2. Insights from the Cleaned Data**

1.Install trends:

Some apps have millions of installs while others remain very low, indicating popularity differences.

2.Price vs Rating:

Paid apps generally have slightly lower downloads, but some high-priced apps have high ratings, showing value for money is recognized by users.

3.App size variation:

Large apps don’t always get higher ratings; users often prefer lightweight apps.

4.Sentiment vs Rating:

There can be gaps where the sentiment in reviews is lower than the rating, indicating hidden issues like bugs, ads annoyance, or pricing complaints.

5.Data quality observation:

Some apps had missing or inconsistent data, emphasizing the importance of cleaning for accurate analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code (Correlation Between App Features)
import matplotlib.pyplot as plt
import seaborn as sns

# --- STEP 1: ROBUST CLEANING (Fixing the ValueError) ---
# We force 'Installs' and 'Price' to numeric. 'coerce' turns any text errors into NaN.
apps_df['Installs'] = pd.to_numeric(apps_df['Installs'].astype(str).str.replace(r'[+,]', '', regex=True), errors='coerce')
apps_df['Price'] = pd.to_numeric(apps_df['Price'].astype(str).str.replace('$', '', regex=False), errors='coerce')
apps_df['Reviews'] = pd.to_numeric(apps_df['Reviews'], errors='coerce')

# --- STEP 2: PREPARE CORRELATION DATA ---
# We select only the numeric columns and drop rows with NaN so the correlation is accurate
numeric_df = apps_df[['Rating', 'Reviews', 'Size', 'Installs', 'Price']].dropna()

# --- STEP 3: VISUALIZATION ---
plt.figure(figsize=(10, 8))
sns.set_style('white')

# Create the heatmap
corr_matrix = numeric_df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)

plt.title('Correlation Between App Features', fontsize=15, pad=20)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the Correlation Heatmap for the following reasons:

**Holistic Overview:** Instead of looking at variables one by one, a heatmap allows us to visualize the relationships between all numeric variables (Rating, Reviews, Size, Installs, Price) simultaneously in a single view.

**Identifying Multi-collinearity:** It helps us identify which features move together. For instance, if Reviews and Installs are almost perfectly correlated, we know they provide similar information, which is vital for future predictive modeling.

**Pattern Detection:** The color-coded intensity (Red to Blue or Green) makes it easy for stakeholders to spot "hidden" patterns. For example, it quickly reveals if a higher Price has a negative impact on Rating or Installs.

**Efficiency:** It serves as a "diagnostic" tool. By running this chart first, we can see which variables have the strongest relationships, which then tells us where to focus our more detailed charts (like Scatter plots or Bar charts) later.

##### 2. What is/are the insight(s) found from the chart?

**The Popularity Engine:** There is a strong positive correlation between Reviews and Installs. This suggests that the number of reviews is a primary driver (or a primary result) of an app's reach.

**Price Neutrality:** Price typically shows a very low correlation with Rating. This indicates that users do not necessarily give higher ratings just because they paid for an app; quality is independent of the price tag.

**Size vs. Performance:** Size usually shows a low correlation with Rating, suggesting that as long as the app works well, users aren't overly concerned with how many Megabytes it occupies.

**Rating Stability:** Rating often has a low correlation with Installs. This means that just because an app is famous (many installs) doesn't mean it is the highest quality (best rating).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**Positive Business Impact**

**Popularity Engine (Reviews → Installs):**
Knowing that more reviews are associated with higher installs can guide marketing and user engagement strategies.

Encouraging satisfied users to leave reviews can increase visibility and reach, leading to higher revenue and market share.

**Price Neutrality:**
Since price doesn’t strongly affect ratings, businesses can experiment with pricing strategies without worrying about lowering perceived app quality.

This can help optimize monetization while maintaining good user satisfaction.

**Size vs Performance:**
Low correlation between size and rating suggests apps can focus on adding features or improving performance rather than worrying about keeping the app extremely lightweight.

**Insights That Could Lead to Negative Growth**

**Rating Stability (Popularity ≠ Quality):**
High installs do not guarantee high ratings. Apps that gain popularity purely through marketing but have bugs, intrusive ads, or poor UX may see declining user satisfaction.

This can lead to churn, negative reviews, and ultimately hurt brand reputation.

**Justification:** For example, an app with millions of installs but frequent crashes or excessive ads may see negative sentiment in reviews, which could reduce long-term engagement and retention.

#### Chart - 2

In [None]:
# Chart - 2 visualization code (Top 10 Most Popular Categories (Total Installs))
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Group by Category and sum the Installs
# We sort to find the most popular categories
category_installs = apps_df.groupby('Category')['Installs'].sum().sort_values(ascending=False).head(10)

# 2. Create the Visualization
plt.figure(figsize=(12, 7))
sns.barplot(x=category_installs.values, y=category_installs.index, palette='viridis')

# 3. Add Labels and Title
plt.title(' Top 10 Most Popular Categories (Total Installs)', fontsize=16, fontweight='bold')
plt.xlabel('Total Installs (in Billions)', fontsize=12)
plt.ylabel('Category', fontsize=12)

# Formatting the x-axis to be more readable
plt.ticklabel_format(style='plain', axis='x')
plt.show()


##### 1. Why did you pick the specific chart?

I picked a horizontal bar chart because it clearly shows the top 10 most popular app categories by total installs. This type of chart makes it easy to compare categories at a glance and see which ones dominate in popularity.

Horizontal bars are especially useful when category names are long, so they remain readable.

Summing installs and visualizing the top 10 helps highlight key segments that drive the most user engagement.

The chart directly supports business insights, like which categories to focus marketing or development efforts on.

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

**Top Categories by Popularity:** Certain app categories, such as GAME, SOCIAL, and ENTERTAINMENT, dominate in total installs, indicating high user demand.

**Market Focus:** These categories attract the largest audience, so businesses can prioritize app development, marketing, or monetization strategies in these segments.

**Opportunity Identification:** Categories with lower installs may indicate niche markets or emerging opportunities where competition is lower.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**Focus on High-Demand Categories:** The chart shows which categories have the highest installs (e.g., GAME, SOCIAL, ENTERTAINMENT). Businesses can prioritize these categories to maximize reach, downloads, and revenue.

**Marketing & Monetization Strategy:** Understanding top categories helps in targeted marketing campaigns and deciding where to invest in ad spend or feature improvements.

**Resource Allocation:** Companies can allocate development resources to categories with proven high engagement, reducing risk and improving ROI.

***Potential Negative Growth Insights***

**High Competition in Top Categories:** Popular categories also mean intense competition. Launching a new app in these segments without differentiation may lead to low visibility and slow growth.

**Neglected Niche Categories:** Focusing only on top categories might ignore niche markets, missing opportunities for innovation or capturing less competitive segments.

***Justification:*** For example, entering a category like GAME, which is already crowded, could result in high marketing costs and poor adoption unless the app offers unique features. Conversely, niche categories with moderate installs may have less competition and more room for growth.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Identify the Top 10 categories by the number of reviews in the merged dataset
top_10_sentiment_cats = merged_df['Category'].value_counts().head(10).index
filtered_sentiment_df = merged_df[merged_df['Category'].isin(top_10_sentiment_cats)]

# 2. Create the Visualization
plt.figure(figsize=(14, 8))
sns.boxplot(x='Sentiment_Polarity', y='Category', data=filtered_sentiment_df, palette='Set3')

# 3. Add a vertical line at 0 (Neutral Sentiment) for reference
plt.axvline(0, color='red', linestyle='--', label='Neutral Threshold')

# 4. Add Labels and Title
plt.title('Chart 3: Sentiment Polarity Distribution by Category (Top 10)', fontsize=16, fontweight='bold')
plt.xlabel('Sentiment Polarity (Range: -1.0 to 1.0)', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.legend()
plt.grid(axis='x', linestyle='--', alpha=0.6)

plt.show()



##### 1. Why did you pick the specific chart?

I chose a boxplot of sentiment polarity by category because it clearly shows the distribution of user sentiment across the top app categories. This chart allows us to see:

Which categories generally receive positive, neutral, or negative feedback.

The spread of opinions, highlighting consistency or variability in user experience.

Outliers, which may indicate extreme positive or negative reviews.

Overall, it provides a direct way to analyze user satisfaction and identify categories that may need improvement or present business opportunities.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Chart:**

**Positive vs Negative Sentiment:** Some categories have mostly positive sentiment (median above 0), indicating good user satisfaction.

**Inconsistent User Experience:** Categories with a wide spread of polarity show mixed reviews—users have varied experiences with apps in that category.

**Problematic Categories:** Categories where many reviews fall below 0 may indicate issues such as bugs, intrusive ads, or pricing complaints.

**Outliers:** Extreme positive or negative reviews highlight apps that stand out, either performing exceptionally well or poorly.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**Identify Strengths:** Categories with predominantly positive sentiment highlight successful apps and features, helping businesses replicate their strategies in other apps.

**Improve Engagement:** Understanding user sentiment allows developers to prioritize improvements, address complaints, and enhance user experience.

**Targeted Marketing:** Apps in categories with strong positive sentiment can be promoted confidently, leveraging their good reputation to attract more users.

***Potential Negative Growth Insights***

**Inconsistent or Negative Sentiment:** Categories with wide polarity ranges or many reviews below 0 indicate user dissatisfaction.

**Justification:** For example, a popular app with frequent negative reviews (bugs, ads, or pricing complaints) can damage brand reputation, reduce user retention, and slow growth despite high installs.

Actionable Risk: Ignoring these insights could result in losing users to competitors or receiving more negative reviews, ultimately impacting revenue and long-term success.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Chart 4: Rating vs Price (bubble plot with Installs as size)
plt.figure(figsize=(12, 8))
sns.scatterplot(
    data=apps_df,
    x='Price',
    y='Rating',
    size='Installs',
    hue='Category',
    palette='Set2',
    alpha=0.7,
    sizes=(50, 1000)
)

# Add labels and title
plt.title('Chart 4: App Rating vs Price (Bubble size = Installs)', fontsize=16, fontweight='bold')
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Rating', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Category')
plt.grid(True, linestyle='--', alpha=0.5)
plt.show()


##### 1. Why did you pick the specific chart?

I chose a scatter/bubble plot of Rating vs Price with bubble size representing Installs because it allows us to see how app quality (ratings) relates to pricing and popularity.

It highlights whether higher-priced apps receive better ratings.

The bubble size shows which apps are most popular, helping identify high-performing apps in terms of both rating and downloads.

Using color for categories makes it easy to compare trends across different app types.

Overall, this chart helps link price, user satisfaction, and popularity, which is valuable for business strategy and pricing decisions.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Chart:**

**Price vs Rating:** Most high-rated apps are either free or low-priced, indicating that users don’t necessarily give better ratings to expensive apps.

**Popularity Trends:** Apps with the largest installs are usually free or low-cost, showing that affordable apps attract more users.

**Category Patterns:** Certain categories (e.g., GAME, SOCIAL) dominate both in popularity and high ratings, while others may have fewer installs despite good ratings.

**Outliers:** Some expensive apps achieve high ratings but low installs, suggesting a niche audience willing to pay for quality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**Pricing Strategy:** Insights show that most high-rated apps are free or low-priced. Businesses can focus on affordable pricing to maximize downloads while maintaining quality.

**Category Focus:** Categories with high ratings and high installs (like GAME or SOCIAL) indicate profitable segments for marketing and development.

**Identify Premium Opportunities:** Expensive apps with high ratings but fewer installs show niche markets, which can be targeted with specialized campaigns.

***Potential Negative Growth Insights***

**High-Priced Apps Risk:** Apps that are expensive but have low installs may struggle to gain mass adoption, leading to slower growth.

**Justification:** Users are less likely to download high-priced apps unless they perceive strong value. Ignoring pricing trends could result in lost revenue and low market penetration.

**Low-Rated Popular Apps:** If a free or low-cost app has many installs but low ratings, it can harm brand reputation, reduce retention, and lead to negative reviews.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Chart - 5: Relationship between App "Freshness" (Update Year) and Average Rating

# 1. Prepare the data: Extract the Year from the 'Last Updated' column
# We use errors='coerce' to handle any remaining dirty data and .dt.year to get the year
apps_df['Last Updated'] = pd.to_datetime(apps_df['Last Updated'], errors='coerce')
apps_df['Update_Year'] = apps_df['Last Updated'].dt.year

# 2. Group by Update_Year and calculate the average rating
yearly_rating = apps_df.groupby('Update_Year')['Rating'].mean().reset_index()

# 3. Create the Visualization
plt.figure(figsize=(12, 6))
sns.lineplot(data=yearly_rating, x='Update_Year', y='Rating', marker='o', color='darkorange', linewidth=3)

# 4. Add Labels and Title
plt.title('Chart 5: Relationship between Last Update Year and Average Rating', fontsize=16, fontweight='bold')
plt.xlabel('Year the App was Last Updated', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)

# Ensure the X-axis shows whole years (integers)
plt.locator_params(axis='x', integer=True)

plt.show()



##### 1. Why did you pick the specific chart?

I chose a line chart showing the relationship between the last update year and average rating because it clearly captures how app ratings change over time. App updates are time-based, and a line chart is the best way to observe trends and patterns across years.

This chart helps understand whether regularly updated (fresh) apps maintain better ratings, which is important for evaluating the impact of maintenance, bug fixes, and feature updates on user satisfaction.

##### 2. What is/are the insight(s) found from the chart?

Apps that were updated more recently tend to have higher average ratings, indicating that users value regular updates and active maintenance.

Older apps that have not been updated for many years show relatively lower average ratings, suggesting performance issues, outdated features, or compatibility problems.

The overall trend shows a gradual improvement in ratings over recent years, which implies that continuous updates, bug fixes, and feature enhancements positively influence user satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, the gained insights can create a positive business impact.

The chart shows that apps updated more recently receive higher average ratings, which directly improves Play Store rankings, visibility, and download potential.

Regular updates help in fixing bugs, improving performance, and adding features, which increases user satisfaction and reduces negative reviews.

Higher ratings build user trust, leading to better retention, positive word of mouth, and increased revenue through ads or paid features.

***Insights Leading to Negative Growth:***

Apps that have not been updated for a long time show lower ratings, which can negatively impact growth.

Outdated apps may suffer from compatibility issues with newer Android versions, unresolved bugs, and security risks, causing users to leave negative reviews.

Lower ratings reduce search ranking and install rates, leading to declining user base and revenue loss.

**Justification:**

This clearly indicates that ignoring regular updates leads to negative growth, while maintaining app freshness supports sustained business success.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Prepare data: calculate average sentiment per app
avg_sentiment = reviews_df.groupby('App')['Sentiment_Polarity'].mean().reset_index()

# 2. Merge with app ratings
sentiment_rating_df = pd.merge(
    apps_df[['App', 'Rating']],
    avg_sentiment,
    on='App',
    how='inner'
)

# 3. Create the visualization
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=sentiment_rating_df,
    x='Sentiment_Polarity',
    y='Rating',
    size='Rating',
    hue='Rating',
    palette='viridis',
    sizes=(50, 300),
    alpha=0.8
)

# 4. Add reference line for neutral sentiment
plt.axvline(0, color='red', linestyle='--', label='Neutral Sentiment')

# 5. Labels and title
plt.title('Chart 6: Sentiment-to-Rating Gap Analysis', fontsize=16, fontweight='bold')
plt.xlabel('Average Sentiment Polarity', fontsize=12)
plt.ylabel('App Rating', fontsize=12)
plt.legend()
plt.grid(True, linestyle='--', alpha=0.5)

plt.show()


##### 1. Why did you pick the specific chart?

I picked this chart because it clearly shows the relationship between user sentiment and app ratings, which is the core focus of the project. A scatter plot is ideal here because it helps compare two continuous variables at the same time and makes it easy to spot gaps where sentiment and ratings do not align. This visualization allows quick identification of apps that receive positive feedback but still have low ratings, or apps with negative sentiment and poor ratings. These mismatches directly indicate potential issues like bugs, excessive ads, or pricing dissatisfaction that affect app performance and rankings.

##### 2. What is/are the insight(s) found from the chart?

**1.Sentiment-to-Rating Gaps:** Some apps have positive sentiment but lower ratings, indicating users like the app but may be frustrated by specific issues such as ads or pricing.

**2.Negative Sentiment and Low Ratings:** Apps with negative sentiment tend to have lower ratings, confirming that bugs, crashes, or poor UX directly affect user perception.

**3.Neutral Sentiment Apps:** Apps around neutral polarity often have average ratings, suggesting mixed experiences among users.

**4.Business Focus Areas:** The chart helps identify top apps needing improvement in technical performance or monetization strategy to close the gap between sentiment and rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**1.Identify Improvement Areas:** The chart highlights apps where sentiment is positive but ratings are low. Businesses can focus on fixing technical issues, optimizing ads, or adjusting pricing to convert positive sentiment into higher ratings.

**2.Enhance User Satisfaction:** Understanding the Sentiment-to-Rating gap helps developers prioritize features and improvements that matter most to users, boosting engagement and retention.

**3.Strategic Marketing:** Apps with aligned high sentiment and high ratings can be promoted confidently, strengthening brand reputation and download rates.

***Insights Leading to Negative Growth***

**1.Technical or Monetization Issues:** Apps with negative sentiment and low ratings indicate problems such as bugs, intrusive ads, or poor pricing.

**Justification:** Ignoring these apps can result in declining downloads, negative reviews, and lower Play Store rankings, ultimately reducing revenue and long-term growth.

**Actionable Risk:** Addressing these gaps is essential; otherwise, even popular apps may lose users and market share despite high install numbers.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
import matplotlib.pyplot as plt
import seaborn as sns

# Filter out free apps with price = 0 for better visibility of paid apps
paid_apps_df = apps_df[apps_df['Price'] > 0]

# Scatter plot: Price vs Rating with bubble size = Installs
plt.figure(figsize=(12, 8))
sns.scatterplot(
    data=paid_apps_df,
    x='Price',
    y='Rating',
    size='Installs',
    hue='Category',
    palette='Set2',
    sizes=(50, 1000),
    alpha=0.7
)

# Add labels and title
plt.title('Chart 7: Price vs Rating vs Installs by Category', fontsize=16, fontweight='bold')
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Rating', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Category')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked a bubble scatter plot of Price vs Rating with bubble size representing Installs because it clearly shows how app pricing affects user perception and popularity. The chart allows us to see:

Which categories perform well at higher prices

How ratings vary with price

Which apps are popular despite higher costs

##### 2. What is/are the insight(s) found from the chart?

Insights from the Chart:

Most high-rated apps are low-priced or free, suggesting that users prefer affordable apps even if quality is high.

Paid apps with high prices often have lower installs, indicating that cost can be a barrier to adoption.

Some high-priced apps achieve good ratings but limited downloads, showing a niche audience willing to pay for quality or specialized features.

Category trends: Certain categories, like GAME or EDUCATION, perform well even at higher prices, while others struggle, highlighting which segments are more price-sensitive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**Optimized Pricing Strategy:** Insights show that most high-rated apps are low-priced or free. Businesses can focus on affordable pricing to maximize downloads while maintaining quality.

**Targeted Niche Marketing:** High-priced apps with good ratings but fewer installs indicate niche audiences. Companies can target these users with premium marketing and features, increasing revenue from willing buyers.

**Category-Specific Decisions:** Understanding which categories tolerate higher prices helps develop category-wise monetization strategies, improving ROI.

***Insights Leading to Negative Growth***


**1.High Price Barrier: ** Paid apps with high prices and low installs show that cost can limit user adoption.

**Justification:** If pricing is not optimized, even high-quality apps may struggle to gain traction, leading to reduced downloads, lower revenue, and slower market growth.

Category Sensitivity: Certain categories are highly price-sensitive; ignoring this may result in lost users and negative impact on brand reputation.

#### Chart - 8

In [None]:
# Chart - 8 visualization code (Top 10 Apps by Number of Reviews)
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Identify top 10 apps by reviews
top_10_reviews = apps_df[['App', 'Reviews']].sort_values(by='Reviews', ascending=False).head(10)

# 2. Plot
plt.figure(figsize=(12, 7))
sns.barplot(
    x='Reviews',
    y='App',
    data=top_10_reviews,
    palette='magma'
)

# 3. Add labels and title
plt.title('Top 10 Apps by Number of Reviews', fontsize=16, fontweight='bold')
plt.xlabel('Number of Reviews', fontsize=12)
plt.ylabel('App Name', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.5)

plt.show()


##### 1. Why did you pick the specific chart?

I picked a horizontal bar chart of the top 10 apps by number of reviews because it clearly highlights which apps have the highest user engagement. The horizontal layout makes it easy to read long app names, and the bar lengths show relative review counts at a glance.

This chart helps quickly identify popular apps that attract the most user attention, which is valuable for understanding market trends, user preferences, and potential drivers of installs and revenue.

##### 2. What is/are the insight(s) found from the chart?

**Insights from the Chart:**

Top apps attract the most user engagement: The apps with the highest number of reviews are likely the most popular and widely used.

High reviews don’t always equal highest ratings: Some apps with many reviews may have average or slightly lower ratings, showing that popularity and quality can differ.

Engagement patterns by category: The chart may reveal which categories consistently attract more user feedback, helping identify trending app types.

Opportunity for improvement: Apps with fewer reviews in the same category might need marketing or user engagement strategies to increase visibility and feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**Identify Popular Apps:** The chart shows which apps receive the most reviews, helping businesses understand which products are highly engaging and potentially profitable.

**Focus on High-Engagement Strategies:**
Insights from top-reviewed apps can guide marketing campaigns, feature improvements, or promotional strategies for similar apps to increase user interaction.

**Category Benchmarking:**
Businesses can compare apps within the same category to set engagement targets and improve app performance metrics.

***Insights Leading to Negative Growth***

**Low-Engagement Apps:** Apps with fewer reviews, even if they have good ratings, may indicate low user interaction or visibility issues.

**Justification:** Ignoring these apps can result in stagnant growth, reduced installs, and missed revenue opportunities, especially if users are not leaving feedback to guide improvements.

Actionable Risk: Without increasing engagement or soliciting user feedback, these apps may fall behind competitors, negatively impacting long-term growth.

#### Chart - 9

In [None]:
# Chart - 9 visualization code ( App Rating vs Number of Installs by Category)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Step 1: 'Installs' column

def clean_installs(installs):
    try:
        if isinstance(installs, str):
            installs = installs.replace(',', '').replace('+', '').strip()
        return float(installs)
    except:
        return np.nan

apps_df['Installs'] = apps_df['Installs'].apply(clean_installs)

# Clean 'Rating'
apps_df['Rating'] = pd.to_numeric(apps_df['Rating'], errors='coerce')

# Drop missing values
installs_rating_df = apps_df.dropna(subset=['Installs', 'Rating'])

# Optional: log-transform Installs for better visualization
installs_rating_df['Log_Installs'] = np.log1p(installs_rating_df['Installs'])

# ----------------------
# Step 2: Plot
# ----------------------
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=installs_rating_df,
    x='Log_Installs',
    y='Rating',
    hue='Category',
    palette='tab10',
    alpha=0.7,
    s=100
)

plt.title(' App Rating vs Number of Installs by Category', fontsize=16, fontweight='bold')
plt.xlabel('Log(Number of Installs)', fontsize=12)
plt.ylabel('Rating', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Category')
plt.tight_layout()
plt.show()




##### 1. Why did you pick the specific chart?

I chose a scatter plot of App Rating vs Number of Installs because it effectively shows the relationship between popularity and user satisfaction. The log scale for installs makes it easier to visualize apps with very high or very low installs. This chart helps identify apps that are widely used but may have lower ratings, as well as apps that are less popular but highly rated, giving insight into potential improvements and growth opportunities across categories.

##### 2. What is/are the insight(s) found from the chart?

**1.High installs don’t always mean high ratings:** Some apps with a large number of installs have average or slightly lower ratings, indicating that popularity does not guarantee user satisfaction.

**2.Low installs but high ratings:** Certain niche apps have fewer installs but maintain very high ratings, showing strong satisfaction among a smaller audience.

**3.Category patterns:** Some categories, like Games or Education, tend to achieve both high installs and high ratings, while others may struggle to balance popularity and quality.

**4.Opportunities for improvement:** Popular apps with lower ratings could be optimized for performance, UX, or feature improvements to maintain user satisfaction and reduce negative feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**1.Identify Optimization Opportunities:** Apps with high installs but lower ratings highlight areas where performance, UX, or features can be improved, increasing user satisfaction and retention.

**2.Leverage Niche Success:** Apps with fewer installs but high ratings show potential for targeted marketing, helping attract a larger audience without sacrificing quality.

**3.Category Strategy:** Understanding which categories consistently achieve high installs and high ratings can guide investment, feature development, and promotional focus, improving ROI.

***Insights Leading to Negative Growth***

**Popular apps with low ratings:** If not addressed, these apps may see negative reviews, churn, and declining installs, harming the brand and revenue.

Justification: High install numbers alone cannot sustain growth; poor ratings impact Play Store rankings and user trust, potentially leading to long-term revenue loss.

Category-specific risk: Some categories may be highly competitive; ignoring rating issues in these categories can result in losing market share to competitors.

#### Chart - 10

In [None]:
# Chart - 10 visualization code (Sentiment Polarity Distribution for Top 10 Most Reviewed Apps)
import matplotlib.pyplot as plt
import seaborn as sns

# ----------------------
# Step 1: Identify top 10 apps by reviews
# ----------------------
top_10_apps = apps_df[['App', 'Reviews']].sort_values(by='Reviews', ascending=False).head(10)['App']

# Filter reviews for these apps
top_reviews_df = reviews_df[reviews_df['App'].isin(top_10_apps)]

# ----------------------
# Step 2: Plot boxplot of Sentiment Polarity by App
# ----------------------
plt.figure(figsize=(14, 8))
sns.boxplot(
    x='Sentiment_Polarity',
    y='App',
    data=top_reviews_df,
    palette='Set2'
)

# Add vertical line for neutral sentiment
plt.axvline(0, color='red', linestyle='--', label='Neutral Threshold')

# Labels and title
plt.title('Sentiment Polarity Distribution for Top 10 Most Reviewed Apps',
          fontsize=16, fontweight='bold')
plt.xlabel('Sentiment Polarity (-1 to 1)', fontsize=12)
plt.ylabel('App', fontsize=12)
plt.legend()
plt.grid(axis='x', linestyle='--', alpha=0.5)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a boxplot of Sentiment Polarity for the top 10 most reviewed apps because it effectively shows the distribution of user opinions for the most popular apps. The boxplot highlights positive, neutral, and negative sentiment clearly, making it easy to spot apps where user satisfaction is inconsistent. This chart helps identify which high-engagement apps may have issues affecting their ratings or rankings, supporting data-driven decisions to improve app performance and user experience.

##### 2. What is/are the insight(s) found from the chart?

**1.Most top-reviewed apps maintain positive sentiment:** The median sentiment for most apps is above zero, showing general user satisfaction.

**2.Some apps show wide sentiment variability:** Certain apps have both highly positive and negative reviews, indicating mixed experiences due to bugs, ads, or pricing issues.

**3.Neutral or negative outliers:** A few negative sentiment reviews can significantly affect app ratings despite overall popularity.

**4,Engagement vs satisfaction gap:** Even popular apps with many reviews may have hidden user dissatisfaction, which can be addressed to improve retention and ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**1.Targeted Improvements:** The chart highlights top-reviewed apps with mixed or negative sentiment, helping developers focus on fixing bugs, optimizing features, or adjusting monetization strategies.

**2.User Retention & Satisfaction:** Understanding sentiment distribution allows businesses to enhance user experience, converting neutral or dissatisfied users into loyal customers.

**3.Strategic Marketing:** Apps with consistently positive sentiment can be promoted confidently, boosting downloads, engagement, and revenue.

***Insights Leading to Negative Growth***

**1.Mixed or Negative Sentiment:** Even highly reviewed apps can have negative sentiment outliers, signaling unresolved issues.

**Justification:** Ignoring these concerns can lead to declining ratings, negative reviews, and reduced Play Store rankings, potentially decreasing installs and revenue over time.

Actionable Risk: Consistently monitoring sentiment is essential; failing to do so may cause loss of user trust and market share, despite the app’s popularity.

#### Chart - 11

In [None]:
# Chart - 11 visualization code (Average Rating vs Average Price by Category)
import matplotlib.pyplot as plt
import seaborn as sns

# Step 1:  'Price' column

apps_df['Price'] = apps_df['Price'].astype(str).str.replace('$', '').str.strip()
apps_df['Price'] = pd.to_numeric(apps_df['Price'], errors='coerce')

# Step 2: Calculate average Rating and Price per Category

category_stats = apps_df.groupby('Category')[['Rating', 'Price']].mean().reset_index()

# Step 3: Plot
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=category_stats,
    x='Price',
    y='Rating',
    size='Rating',          # Bubble size by Rating
    hue='Category',
    palette='tab20',
    sizes=(100, 500),
    alpha=0.8
)

# Add labels and title
plt.title(' Average Rating vs Average Price by Category', fontsize=16, fontweight='bold')
plt.xlabel('Average Price ($)', fontsize=12)
plt.ylabel('Average Rating', fontsize=12)
plt.grid(True, linestyle='--', alpha=0.5)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left', title='Category')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked a scatter plot of Average Rating vs Average Price by Category because it clearly shows how pricing affects user satisfaction across different app categories. Using bubble size to represent average rating makes it easy to spot which categories perform well even at higher prices, and which are more price-sensitive. This chart helps identify monetization opportunities and risks while comparing category performance in terms of both user satisfaction and pricing strategy.

##### 2. What is/are the insight(s) found from the chart?

**1.High-rated categories at higher prices:** Certain categories, like Games or Education, maintain high average ratings even with higher prices, indicating users are willing to pay for quality.

**2.Price-sensitive categories:** Some categories show lower ratings as average price increases, suggesting that users in these segments prefer free or low-cost apps.

**3.Opportunities for monetization:** Categories with low average prices but high ratings may have potential to introduce premium features or in-app purchases without harming user satisfaction.

**4.Category performance comparison:** The chart helps identify which categories are performing well both in user satisfaction (rating) and pricing strategy, and which may need adjustments.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**1.Optimized Pricing Strategies:** By identifying categories that maintain high ratings at higher prices, businesses can set appropriate price points to maximize revenue without affecting user satisfaction.

**2.Monetization Opportunities:** Categories with high ratings but low prices can explore premium features or in-app purchases, increasing profitability.

**3.Category Benchmarking:** Comparing average ratings and prices across categories helps guide investment, marketing, and development priorities, improving ROI.

***Insights Leading to Negative Growth***

**1.Price-sensitive categories:** Some categories show lower ratings as price increases. Charging too much in these segments can reduce installs and user satisfaction, harming long-term growth.

**Justification:** Ignoring user price sensitivity may result in negative reviews, lower ratings, and decreased Play Store rankings, reducing app visibility and revenue potential.

Actionable Risk: Businesses need to balance pricing with perceived value, especially in categories where users expect free or low-cost apps.

#### Chart - 12

In [None]:
# Chart - 12 visualization code (Estimated Revenue by App Category)
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Step 1:  'Price' and 'Installs' columns
apps_df['Price'] = apps_df['Price'].astype(str).str.replace('$', '').str.strip()
apps_df['Price'] = pd.to_numeric(apps_df['Price'], errors='coerce')

def clean_installs(installs):
    try:
        if isinstance(installs, str):
            installs = installs.replace(',', '').replace('+', '').strip()
        return float(installs)
    except:
        return np.nan

apps_df['Installs'] = apps_df['Installs'].apply(clean_installs)


# Step 2: Calculate Estimated Revenue per App (Price * Installs)
apps_df['Estimated_Revenue'] = apps_df['Price'] * apps_df['Installs']


# Step 3: Aggregate Revenue by Category
category_revenue = apps_df.groupby('Category')['Estimated_Revenue'].sum().sort_values(ascending=False).reset_index()


# Step 4: Plot
plt.figure(figsize=(12, 7))
sns.barplot(
    x='Estimated_Revenue',
    y='Category',
    data=category_revenue,
    palette='coolwarm'
)

# Add labels and title
plt.title(' Estimated Revenue by App Category', fontsize=16, fontweight='bold')
plt.xlabel('Estimated Revenue ($)', fontsize=12)
plt.ylabel('Category', fontsize=12)
plt.grid(axis='x', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I chose a horizontal bar chart of estimated revenue by app category because it clearly shows which categories generate the most revenue by combining price and number of installs. This visualization helps quickly identify high-revenue segments, allowing businesses to prioritize development, marketing, and monetization strategies. The horizontal layout makes category names easier to read and the bar lengths immediately convey relative revenue contributions.

##### 2. What is/are the insight(s) found from the chart?

**1.Top revenue-generating categories:** Certain categories, like Games,
Education, or Productivity, often generate the highest revenue due to a combination of higher prices and large numbers of installs.

**2.Low-priced categories may lag in revenue:** Some categories with many installs but low prices contribute less to overall revenue, showing that high download volume alone doesn’t guarantee high earnings.

**3.Opportunity for monetization:** Categories with moderate installs but higher prices may have untapped revenue potential if marketed effectively.

**4.Strategic focus areas:** The chart helps identify which categories deserve more investment or premium features to maximize revenue growth.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**1.Revenue Optimization:** The chart identifies the highest revenue-generating
categories, allowing businesses to focus development, marketing, and monetization efforts where they will have the most financial impact.

**2.Strategic Investments:** Categories with moderate installs but higher prices can be promoted or enhanced to increase revenue potential.

**3.Pricing Strategy:** Insights can guide adjustments in pricing or premium features for categories with high install volumes but low revenue, boosting overall profitability.

***Insights Leading to Negative Growth***

**1.Low-revenue categories:** Some categories may have many installs but generate little revenue due to low pricing, limiting profitability.

**Justification:**
Focusing solely on high-download categories without considering revenue may lead to missed monetization opportunities, reducing overall business growth.

Actionable Risk: Ignoring these insights can result in suboptimal investment and slower revenue growth, even if the apps are popular in terms of installs.

#### Chart - 13

In [None]:
# Chart - 13 visualization code (Sentiment Polarity vs App Rating (Sentiment–Rating Gap))
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd


# Step 1: Prepare data

# Calculate average sentiment per app
avg_sentiment = reviews_df.groupby('App')['Sentiment_Polarity'].mean().reset_index()

# Merge with app ratings
sentiment_rating_df = pd.merge(
    apps_df[['App', 'Rating', 'Category']],
    avg_sentiment,
    on='App',
    how='inner'
)

# Drop missing values
sentiment_rating_df.dropna(subset=['Rating', 'Sentiment_Polarity'], inplace=True)

# Step 2: Plot
plt.figure(figsize=(12, 7))
sns.scatterplot(
    data=sentiment_rating_df,
    x='Sentiment_Polarity',
    y='Rating',
    hue='Category',
    palette='tab10',
    alpha=0.7,
    s=100
)

# Reference lines
plt.axhline(sentiment_rating_df['Rating'].mean(), color='gray', linestyle='--', label='Avg Rating')
plt.axvline(0, color='red', linestyle='--', label='Neutral Sentiment')

# Labels and title
plt.title('Sentiment Polarity vs App Rating (Sentiment–Rating Gap)', fontsize=16, fontweight='bold')
plt.xlabel('Average Sentiment Polarity (-1 to 1)', fontsize=12)
plt.ylabel('App Rating', fontsize=12)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked this Sentiment Polarity vs App Rating scatter plot because it directly helps identify the Sentiment-to-Rating gap, which is the core objective of the project. This chart makes it easy to see whether user emotions expressed in reviews align with the star ratings shown on the Play Store. By plotting sentiment against ratings and adding reference lines, we can quickly spot apps where positive sentiment does not translate into high ratings or where negative sentiment is hidden behind acceptable ratings. This makes the chart highly effective for diagnosing underlying issues such as bugs, ads, or pricing concerns.

##### 2. What is/are the insight(s) found from the chart?

**1.Sentiment–Rating mismatch exists:** Several apps show positive sentiment but comparatively lower ratings, indicating that users like the app overall but are dissatisfied with specific issues such as bugs, excessive ads, or pricing.

**2.Hidden risk apps:** Some apps have acceptable ratings but negative sentiment, suggesting that current ratings may decline in the future if user concerns are not addressed.

**3.Strong alignment apps:** Apps where both sentiment polarity and ratings are high represent well-optimized apps with good performance and user satisfaction.

**4.Category-wise variation:** Certain categories display a wider sentiment–rating gap, highlighting that technical stability and monetization strategies impact user perception differently across categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

***Positive Business Impact***

**1.Actionable problem detection:** The chart helps identify apps with positive sentiment but low ratings, allowing businesses to focus on fixing specific issues like bugs, intrusive ads, or confusing pricing. Resolving these issues can directly improve ratings and Play Store rankings.

**2.Better retention and growth:** Apps with aligned high sentiment and high ratings can be treated as best-practice models, helping teams replicate successful features and UX decisions across other apps.

**3.Data-driven prioritization:** By spotting sentiment–rating gaps early, businesses can prioritize technical fixes and monetization changes before ratings drop and user churn increases.

***Insights Leading to Negative Growth***

**Negative sentiment despite decent ratings:** Apps that show negative sentiment but still maintain average ratings represent a hidden risk. If ignored, these apps are likely to receive poorer future ratings.

**Justification:** Negative reviews often mention issues like bugs, performance problems, or aggressive ads. Over time, this leads to lower ratings, reduced visibility in the Play Store, fewer installs, and revenue decline.

Strategic risk: Failing to act on sentiment signals can cause even popular apps to lose user trust and market position, resulting in negative long-term growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# ----------------------
# Step 1: Select numeric columns
# ----------------------
numeric_cols = ['Rating', 'Reviews', 'Installs', 'Price', 'Size']

# Ensure columns exist
numeric_cols = [col for col in numeric_cols if col in apps_df.columns]

# ----------------------
# Step 2: Convert columns to numeric safely
# ----------------------
for col in numeric_cols:
    apps_df[col] = pd.to_numeric(apps_df[col], errors='coerce')

# Drop rows with missing values
corr_df = apps_df[numeric_cols].dropna()

# ----------------------
# Step 3: Compute correlation matrix
# ----------------------
correlation_matrix = corr_df.corr()

# ----------------------
# Step 4: Plot heatmap
# ----------------------
plt.figure(figsize=(10, 6))
sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt=".2f",
    cmap='coolwarm',
    linewidths=0.5
)

plt.title('Correlation Heatmap of Key App Features', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

I picked a correlation heatmap because it provides a quick and clear view of how key numerical variables such as Rating, Reviews, Installs, Price, and Size are related to each other. Instead of analyzing each relationship separately, the heatmap summarizes all correlations in one visual. This makes it easier to identify strong, weak, or no relationships between variables, which is essential for understanding the main factors that influence app popularity, user satisfaction, and monetization.

##### 2. What is/are the insight(s) found from the chart?

**1.Strong positive correlation between Reviews and Installs:** Apps with more installs tend to receive more reviews, indicating that user engagement grows as app reach increases.

**2.Weak correlation between Rating and Installs:** Highly installed apps do not always have the highest ratings, showing that popularity does not guarantee quality.

**3.Price shows very low correlation with Rating:** Paid apps are not necessarily rated higher, suggesting that users value app quality and experience more than price.

**4.Size has minimal impact on Rating:** App size does not significantly affect user satisfaction as long as the app performs well.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


# Step 1: Select numeric columns
pair_cols = ['Rating', 'Reviews', 'Installs', 'Price', 'Size']

# Keep only existing columns
pair_cols = [col for col in pair_cols if col in apps_df.columns]

# Step 2: Convert to numeric safely
for col in pair_cols:
    apps_df[col] = pd.to_numeric(apps_df[col], errors='coerce')

# Drop missing values
pair_df = apps_df[pair_cols].dropna()

# Optional: sample data if dataset is large (avoids slow plotting)
pair_df = pair_df.sample(min(1000, len(pair_df)), random_state=42)


# Step 3: Create Pair Plot
sns.pairplot(
    pair_df,
    diag_kind='kde',
    corner=True
)

plt.suptitle('Pair Plot of Key App Features', y=1.02, fontsize=16, fontweight='bold')
plt.show()


##### 1. Why did you pick the specific chart?

I picked the pair plot because it allows us to analyze multiple numerical variables together in one view rather than looking at them separately.

This chart is especially useful in EDA because it:

Shows the relationship between key app metrics like Rating, Reviews, Installs, Price, and Size at the same time.

Helps quickly identify patterns, trends, and correlations between variables that may influence app performance.

Displays both distributions (diagonal plots) and pairwise relationships (scatter plots), making it easier to spot outliers or unusual behavior.

##### 2. What is/are the insight(s) found from the chart?

**1.Ratings vs Reviews**
Apps with a higher number of reviews generally show more stable and slightly higher ratings. Apps with very few reviews tend to have widely spread ratings, which indicates lower reliability of ratings for new or less-used apps.

**2.Installs vs Reviews**
There is a strong positive relationship between installs and reviews. As the number of installs increases, the number of reviews also increases, which makes sense because more users lead to more feedback.

**3.Price vs Installs**
Paid apps usually have fewer installs compared to free apps. Higher prices are associated with lower installation counts, showing user sensitivity to pricing.

**4.Size vs Installs**
App size does not show a strong direct relationship with installs. Both small and large apps can achieve high installs, suggesting that size alone is not a deciding factor for users.

**5.Outliers**
A few apps stand out with extremely high installs or reviews. These are likely popular or well-established apps and can heavily influence averages.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**1.Reduce the Sentiment to Rating Gap**
Negative sentiment often comes from ads, bugs, and pricing issues. Actively monitor user reviews using sentiment analysis and prioritize fixes for frequently mentioned complaints. This will help convert negative feedback into better ratings.

**2.Improve App Stability and Bug Fix Cycles**
Apps that are updated regularly tend to have better ratings. Release frequent updates focused on bug fixes and performance improvements, and clearly mention these fixes in update notes so users feel heard.

**3.Optimize Ad Strategy**
Excessive or intrusive ads are a major cause of dissatisfaction. Use fewer but better placed ads, and offer ad-free or low-ad premium versions. This improves user experience without losing revenue.

**4.Review Pricing and Monetization**
Users are price-sensitive. Keep pricing transparent and competitive. Free trials, discounts, or freemium models can increase installs while still generating revenue.

**5.Leverage Reviews and Engagement**
Encourage satisfied users to leave reviews after positive in-app experiences. More reviews improve credibility and reduce rating volatility.

**6.Focus on High-Impact Categories**
Invest more in categories with high installs and engagement but poor sentiment. Improving these areas can quickly boost rankings and visibility.

# **Conclusion**

This exploratory data analysis of Google Play Store apps highlights a clear relationship between user sentiment, app ratings, and overall app performance. The analysis shows that factors such as frequent bugs, intrusive advertisements, and pricing concerns are the primary reasons behind negative user sentiment and lower ratings. Apps that receive regular updates and address user feedback tend to maintain higher ratings and better user satisfaction.

The insights gained from visualizations and sentiment analysis demonstrate that minimizing the gap between user sentiment and ratings can significantly improve Play Store rankings. While aggressive monetization and poor maintenance lead to negative growth, focusing on app quality, transparent pricing, and balanced ad strategies creates a positive business impact.

Overall, this study confirms that data-driven decisions based on user feedback can help developers enhance app performance, improve user experience, and achieve sustainable growth in a competitive app marketplace.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***