# **Project Name**    -  Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

This exploratory data analysis examines Google Play Store app metadata alongside user reviews to uncover descriptive patterns and relationships associated with app engagement and success, strictly through data profiling, cleaning, and visualization-driven insights. The analysis combines app-level attributes (category, rating, reviews, installs, price, size, content rating, genres, last updated, Android version) with review text and sentiment signals to characterize distributions, data quality, and inter-variable associations that shape market structure and user-perceived quality.

The pipeline begins with rigorous data quality checks and standardization. Key steps include parsing and normalizing installs (removing “+” and commas, casting to integers), prices (stripping currency symbols), and sizes (harmonizing MB/KB and “Varies with device”), plus de-duplicating app entries and addressing missing ratings. Outlier detection flags extreme values in reviews, installs, and size, while date parsing enables recency features (days since last update). Separate profiling of the reviews dataset inspects nulls, language or encoding anomalies, and class balance in sentiment labels if present.

Univariate exploration reveals skewed distributions across engagement and popularity metrics. Installs and reviews exhibit heavy right-skew with long tails, necessitating log-scale perspectives for interpretability. Ratings cluster between 3.5 and 4.7 with thinner mass at extremes, and missing ratings are non-randomly distributed across smaller or newer apps. Size shows multimodality due to games and media-heavy categories. Category-level counts confirm the dominance of Games, Productivity, Tools, Communication, and Education in app supply, with variability in median rating and typical size by category.

Bivariate analysis characterizes core relationships. Reviews and installs show a strong monotonic association on log scales, consistent with social proof dynamics; scatterplots and correlation matrices confirm this relationship while highlighting heteroskedasticity at the high-install range. Price relates inversely to installs, with free/freemium clusters significantly outnumbering paid. App size shows weak to moderate association with lower rating in some categories, likely reflecting performance and resource constraints; however, this association attenuates when controlling for category. Recency of update correlates positively with rating and review volume in many segments, visible in grouped boxplots by update age buckets.

Segmented EDA uncovers category-specific nuances. Games and Entertainment have higher variance in both rating and size; Productivity and Tools show more compact rating distributions with mid-to-high medians; Finance and Health often have higher recency of updates and tighter size budgets. Content rating segmentation (Everyone, Teen, Mature) indicates differing install patterns and rating spreads, suggesting audience-driven expectations around ads, UX, and stability. For paid apps, median installs are markedly lower, but median rating can be stable or slightly higher in niche categories, indicating value perception effects.

Reviews EDA focuses on high-level sentiment and themes without modeling. Summary statistics of polarity or basic lexicons reveal that negative feedback frequently clusters around crashes, latency, battery usage, intrusive ads, and gated features; positive feedback highlights ease of use, clean UI, and reliable performance. Time-sliced sentiment distributions show sensitivity to updates and release cycles, and apps with consistent recent positive reviews tend to maintain higher contemporaneous ratings. Word-frequency and bigram exploration identifies recurrent topics per category, linking common pain points (e.g., ads in casual games, login issues in finance, sync reliability in productivity) to observed rating dispersion.

Together, the EDA establishes a descriptive baseline: the Play Store follows a power-law landscape with concentrated success, ratings are moderately high but sensitive to operational quality, and engagement proxies (reviews and installs) co-move strongly while being moderated by price, category, size, and update recency. These findings, derived purely from profiling and visualization, frame hypotheses for subsequent modeling and guide where deeper causal or predictive analyses would be most informative. If desired, this EDA can be extended with robust outlier treatment, seasonality-aware trend charts for reviews, and stratified comparisons across app age cohorts to further refine descriptive insights.





# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**

#### **Define Your Business Objective?**

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. Each app (row) has values for category, rating, size, and more. Another dataset contains customer reviews of the android apps. Explore and analyse the data to discover key factors responsible for app engagement and success.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
# Load Dataset
data1=pd.read_csv(r"/content/Play Store Data.csv")
data2=pd.read_csv(r"/content/User Reviews.csv")

### Dataset First View

In [None]:
data2.head()

In [None]:
data1.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(data2.shape)
print(data1.shape)

In [None]:
#merge both data
df=pd.merge(data1,data2,on='App',how='inner')


In [None]:
#combine both data
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().mean()*100

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().mean()*100

In [None]:
# Visualizing the missing values
df.isna().sum().sort_values(ascending=False)
sns.heatmap(df.isna(), cbar=False)

### What did you know about your dataset?


*   There is 40% missing values in this data
*   In our dataset most of the columns datatype is object , so we have to change to their specific datatype to analysis


*   We have to change "Reviews","size","intalls","price","current version" and "andriod ver" columns to numerical

*   And change "last updated" object column to datetime column






In [None]:
# droping missing values
df=df.dropna()
df.shape

In [None]:
df.head(5)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

In [None]:
# histplot to see sentiment_polarity
sns.histplot(x="Sentiment_Polarity", data=data2, kde=True)
plt.show()

In [None]:
# histplot to see sentiment_subjectivity
sns.histplot(x="Sentiment_Subjectivity", data=data2, kde=True)
plt.show()

### Variables Description

In [None]:
# describing all variabls encluding object
df.describe(include=object)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#change review dtype to numeric

df['Reviews']=df['Reviews'].astype('int64')
# df['Reviews'].value_counts()

In [None]:
#change install dtype to int64
df.loc[:,'Installs']=df['Installs'].str.replace(r'[^\d.-]+','',regex=True)
df['Installs']=df.loc[:,'Installs'].astype('int64')
df.loc[:,'Installs'].dtype

In [None]:
# change price to float
d = df['Price'].astype(str).str.replace("$", "", regex=False).str.replace(",", "", regex=False).str.strip()

df['Price'] = pd.to_numeric(d, errors='coerce').astype('float64')

In [None]:
# change last updated to datetime datatype
df['Last Updated']=pd.to_datetime(df['Last Updated'], format="%B %d, %Y")
df.loc[:,'Last Updated'].dtype

In [None]:
df.info()

In [None]:
# function to change
def convert_str(i):
    if isinstance(i, str) and 'm' in i.lower():
        return float(i.lower().replace('m', '')) * 1e6
    elif isinstance(i, str) and 'k' in i.lower():
        return float(i.lower().replace('k', '')) * 1e3
    else:
        try:
            return float(i)
        except:
            return None

# Apply conversion first
df['Size'] = df['Size'].apply(convert_str)

# Then compute median and fill NaNs
median_size = df['Size'].median()
df['Size'] = df['Size'].fillna(median_size)




In [None]:
df.info()

In [None]:
df.head()

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
# Histplot

sns.histplot(data=df,x='Rating',bins=20,kde=True,color='red')
plt.xticks(rotation=45)
plt.title('Rating Distribution')
plt.show



##### 1. Why did you pick the specific chart?

histplot is best for distribution plots


##### 2. What is/are the insight(s) found from the chart?

Like 80% of the rating is greater than 4

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Means the app rating greater than 3.5 would call successfull



#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Sentiment_Polarity
sns.kdeplot(data=df,x='Sentiment_Polarity',fill=True)
plt.xticks(rotation=45)
plt.title('Sentiment_Polarity distribution')
plt.show()

##### 1. Why did you pick the specific chart?

Best to see distribution of continuous column.

##### 2. What is/are the insight(s) found from the chart?

most of the sentiment in range from 0.00 to 0.25 . Means which is positive

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

#### Chart - 3

In [None]:
# Countplot
# To see top catagories by app count
plt.figure(figsize=(14,8))
order = df['Category'].value_counts().head(15).index
sns.countplot(data=df, y='Category', order=order)
plt.title('Top 15 Categories by App Count')
plt.show()


##### 1. Why did you pick the specific chart?

We choose countplot to get the count of the apps

##### 2. What is/are the insight(s) found from the chart?

most app count in the game category


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From this we can understand that competition is very high in gaming apps


#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Histplot
# To see install distribution
df['Installs_log'] = np.log10(df['Installs'] + 1)  # After cleaning
sns.histplot(data=df,x='Installs_log',bins=20,kde=True)
plt.xticks(rotation=45)
plt.title('Install Distribution')
plt.show

##### 1. Why did you pick the specific chart?

Histplot used here to see the distribution of installs

##### 2. What is/are the insight(s) found from the chart?

This chart tell lower number of apps got high install and higher number of apps got low installs

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This will help in know which apps have higher install

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Boxplot and hisplot
# To see the outliers in reviews
fig, axes = plt.subplots(1,2, figsize=(15,6))
sns.boxplot(data=df, y='Reviews', ax=axes[0])
sns.histplot(data=df, x='Reviews', bins=50, ax=axes[1])
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Boxplot for viewing the outliers and histplot to see distribution

##### 2. What is/are the insight(s) found from the chart?

large number of apps got 0 to 1 review. some small
number of app get high review.

There is outliers in reviews , high reviews on some app



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This insight will help in find the app with high review and taking inspiration to make successful app

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# barplot
# To see most installed app , comapring between apps install category wise
order = df.groupby('Category')['Installs'].mean().sort_values(ascending=False).index

plt.figure(figsize=(15, 6))
sns.barplot(data=df,x='Category',y='Installs',color='green',order=order)
plt.xticks(rotation=90)
plt.title('category and installs')
plt.show()

##### 1. Why did you pick the specific chart?

To see the comparison between apps with high installs

##### 2. What is/are the insight(s) found from the chart?

Most installed categories are **communication** and **photography**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

To make app in this **communication** and **photography** category increase the chance of get successful

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Barplot
# Highest rated categories
plt.figure(figsize=(12,8))
order = df.groupby('Category')['Rating'].mean().sort_values(ascending=False).index
sns.barplot(data=df, x='Rating', y='Category', order=order[:15])
plt.title('Top 15 Categories by Average Rating')
plt.show()


##### 1. Why did you pick the specific chart?

To see the comparison between rating based on category

##### 2. What is/are the insight(s) found from the chart?

**Auto** and **Vehicle** is highest rated catagory

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It will help to make positive business impact by seeing rating

#### Chart - 8

In [None]:
# Chart - 8 visualization code
# Countplot
# Free vs paid apps by category
plt.figure(figsize=(14,8))
sns.countplot(data=df, y='Category', hue='Type',
              order=df['Category'].value_counts().head(15).index)
plt.title('Free vs Paid Apps by Category')
plt.show()


##### 1. Why did you pick the specific chart?

To see the comparison of the count free and paid apps

##### 2. What is/are the insight(s) found from the chart?

we see very less paid app here . most of the app are free

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The free apps have more count, so more chance of success in free apps


#### Chart - 9

In [None]:
# Chart - 9 visualization code
# Scatterplot
# Rating vs Installs by category
plt.figure(figsize=(13,8))
sns.scatterplot(data=df, x='Rating', y='Installs_log',
                hue='Category', alpha=0.6, palette='tab10')
plt.title('Rating vs Installs by Category')
plt.show()


##### 1. Why did you pick the specific chart?

To see correlation between the rating  and install

##### 2. What is/are the insight(s) found from the chart?

We see positive correlation between both

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

This will help us in finding app with high rating low install that niche will have low competator but high growth

#### Chart - 10

In [None]:
# Chart - 10 visualization code
# Boxplot
# To see outliers in sentiment by category
plt.figure(figsize=(12,8))
order = df.groupby('Category')['Sentiment_Polarity'].mean().sort_values(ascending=False).index
sns.boxplot(data=df, x='Sentiment_Polarity', y='Category', order=order[:15])
plt.title('Review Sentiment by Category')
plt.show()


##### 1. Why did you pick the specific chart?

To see the outliers in between sentiment_polarity and category

##### 2. What is/are the insight(s) found from the chart?

There are many outliers and as well as sentiment_polarity of most of app is from 0.00 to 0.50

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The outliers will tell us about the good and bad of apps . This will help in making good app

#### Chart - 11

In [None]:
# Chart - 11 visualization code
# Rating vs Installs, split by Category (top 6 only)
top_cats = df['Category'].value_counts().head(6).index
g = sns.FacetGrid(df[df['Category'].isin(top_cats)],
                  col='Category', col_wrap=3, height=4)
g.map(sns.scatterplot, 'Rating', 'Installs', 'Type')
g.add_legend()
plt.show()


##### 1. Why did you pick the specific chart?

facetgrid to see the different catagory correlations, using scatterplot

##### 2. What is/are the insight(s) found from the chart?

Game , Family and Productivity have good coorelation , mean more rating have more installs

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It will help us to make positive impact

#### Chart - 12

In [None]:
# Chart - 12 visualization code
# How Content Rating affects Category performance
pivot_table = df.pivot_table(values='Rating',
                                  index='Category',
                                  columns='Content Rating',
                                  aggfunc='mean')
plt.figure(figsize=(25,10))
pivot_table.plot(kind='bar', stacked=False)
plt.xticks(rotation=90)
plt.title('Average Rating by Category and Content Rating')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Voilinplot
# Install distributin : free vs paid by category
plt.figure(figsize=(14,10))
sns.violinplot(data=df, x='Type', y='Installs',
               hue='Category', split=True, inner='quart')
plt.title('Installs Distribution: Free vs Paid by Category')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


##### 1. Why did you pick the specific chart?

voilin plots are good in distribution analysis, by seeing width , skewness we can analysiabout the distribution

##### 2. What is/are the insight(s) found from the chart?

The most apps are concentrted in free from -0.1 to 0.2 and in paid there are very less points.
this will tell us free have more installs and paid have less

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Full correlation matrix
plt.figure(figsize=(12,10))
numeric_cols = ['Rating', 'Installs', 'Reviews', 'Sentiment_Polarity',
                'Sentiment_Subjectivity', 'Price']
corr_matrix = df[numeric_cols].corr()
sns.heatmap(corr_matrix, annot=True, center=0, cmap='coolwarm')
plt.title('Feature Correlation Matrix')
plt.show()


##### 1. Why did you pick the specific chart?

Best plot for the find the correlation between columns

##### 2. What is/are the insight(s) found from the chart?

Installs and reviews are most corelated

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Rating, Installs, Reviews, Sentiment correlations
sns.pairplot(df, vars=['Rating', 'Installs', 'Reviews', 'Sentiment_Polarity'],
             hue='Type', diag_kind='kde')
plt.show()


##### 1. Why did you pick the specific chart?

Pair plot used to see all the corelation of numerical columns in one place to better comparison

##### 2. What is/are the insight(s) found from the chart?

The correlation of sentiment polarity with other is very good

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.



*   To bo succesfull you have to try to build app in catagory with low number of app counts.
*   Build free apps to more installations.


*   There are catagory with high rating and high install build app in this catgory for to get very less competition
*   Make app in communication where the app counts is low and installation is high

*   To biuld apps in any catagory see the outliers reviews that will help in usnderstnding about what people want in their app and what they hate about apps







# **Conclusion**

So, according to our business objective to make to do analysis to make our apps successful on play store, by using the given insights we can make good app.

By focusing the less competated categories to be successful like communication and more like that.


### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***