<a href="https://colab.research.google.com/github/ashwinivibhandik18/ATM-Machine_codechef/blob/main/Almabetter_Project2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Play Store App Review Analysis

##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

This capstone project, Play Store App Review Analysis, focuses on understanding the factors that influence app success on the Google Play Store. With thousands of apps competing for user attention, analyzing both app-level metadata and user reviews provides valuable insights for developers and businesses.

The project uses two datasets: the Play Store dataset (10,841 apps, 13 features such as category, rating, installs, size, type, price, and content rating) and the User Reviews dataset (64,295 reviews with sentiment information). The study applies data cleaning, preprocessing, exploratory data analysis (EDA), and sentiment analysis to identify key patterns and trends.

Findings show that Games, Family, and Tools are the most popular categories. Free apps dominate the market, attracting far more installs compared to paid apps, suggesting that freemium models are most effective. App size influences downloads, with excessively large apps discouraging users. Content rating “Everyone” draws the widest user base, highlighting the importance of accessibility.

Sentiment analysis revealed that most reviews are positive, though negative reviews frequently cite crashes, poor updates, or intrusive ads. These insights emphasize the need for regular updates, bug fixes, and user-focused design.

Overall, the project demonstrates how EDA combined with sentiment analysis can uncover the drivers of app popularity and user satisfaction, offering practical recommendations for improving app performance and user engagement.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


Business Context

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. Each app (row) has values for category, rating, size, and more. Another dataset contains customer reviews of the android apps. Explore and analyse the data to discover key factors responsible for app engagement and success.

#### **Define Your Business Objective?**

The business objective of this project is to discover the key factors responsible for app engagement and success on the Google Play Store by analyzing app features and user reviews. Through exploratory data analysis and sentiment analysis, the study aims to provide actionable insights that help developers and businesses improve app quality, user satisfaction, and overall market performance.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


### Dataset First View

In [None]:
# Dataset First Look
playstore_data=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Play Store Data.csv")
playstore_data.head()


In [None]:
user_reviews_data=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/User Reviews.csv")
user_reviews_data.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(playstore_data.shape)
print(user_reviews_data.shape)

### Dataset Information

In [None]:
# Dataset Info
print(playstore_data.info())
print(user_reviews_data.info())

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print(playstore_data.duplicated().sum())
print(user_reviews_data.duplicated().sum())



#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
print(playstore_data.isnull().sum())
print(user_reviews_data.isnull().sum())

In [None]:
missing_counts = playstore_data.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]

missing_counts.plot(kind='bar', figsize=(10, 5), color='green')
plt.title('Number of Missing Values per Column')
plt.ylabel('Count')
plt.xlabel('Column')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()


In [None]:
missing_counts = user_reviews_data.isnull().sum()
missing_counts = missing_counts[missing_counts > 0]

missing_counts.plot(kind='bar', figsize=(10, 5), color='green')
plt.title('Number of Missing Values per Column')
plt.ylabel('Count')
plt.xlabel('Column')
plt.xticks(rotation=0)
plt.grid(axis='y')
plt.show()


### What did you know about your dataset?




The project uses 2 datasets: Play Store data (10,841 rows, 13 columns) and User Reviews data (64,295 rows, 5 columns).

The Rating column has 1,474 missing values in the Play Store dataset.

The User Reviews dataset has about 26,000 missing values in review and sentiment columns.

There are 483 duplicate rows in Play Store data and 33,616 duplicate rows in User Reviews data.

Columns like Installs, Reviews, Price, and Size are stored as text and need conversion.

Most apps in the dataset are free apps.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(playstore_data.columns)
print(user_reviews_data.columns)

In [None]:
# Dataset Describe
print(playstore_data.describe())
print(user_reviews_data.describe())

### Variables Description

**Play Store App Dataset**

App – Name of the mobile application.

Category – Main category of the app (e.g., Game, Education, Business).

Rating – Average user rating of the app (mostly between 4.0 and 4.5).

Reviews – Total number of user reviews received by the app.

Size – Size of the app in MB or KB.

Installs – Number of times the app has been downloaded.

Type – Indicates whether the app is Free or Paid.

Price – Cost of the app (0 for free apps).

Content Rating – Age group the app is suitable for (Everyone, Teen, Adult).

Genres – Sub-category or type of the app.

Last Updated – Date when the app was last updated.

Current Ver – Current version of the app.

Android Ver – Minimum Android version required to run the app.

**User Reviews Dataset**

App – Name of the app for which the review is written.

Translated_Review – User review text translated into English.

Sentiment – Sentiment of the review (Positive, Negative, Neutral).

Sentiment_Polarity – Sentiment score ranging from –1 (negative) to +1 (positive).

Sentiment_Subjectivity – Measures how opinion-based the review is (0 = factual, 1 = personal opinion).


### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
print(playstore_data.loc[:,'App'].unique())
print(playstore_data.loc[:,'Category'].unique())
print(playstore_data.loc[:,'Rating'].unique())
print(playstore_data.loc[:,'Reviews'].unique())
print(playstore_data.loc[:,'Size'].unique())
print(playstore_data.loc[:,'Installs'].unique())
print(playstore_data.loc[:,'Type'].unique())
print(playstore_data.loc[:,'Price'].unique())
print(playstore_data.loc[:,'Content Rating'].unique())
print(playstore_data.loc[:,'Genres'].unique())
print(playstore_data.loc[:,'Last Updated'].unique())
print(playstore_data.loc[:,'Current Ver'].unique())
print(playstore_data.loc[:,'Android Ver'].unique())


In [None]:
# Check Unique Values for each variable.
print(user_reviews_data.loc[:,'App'].unique())
print(user_reviews_data.loc[:,'Translated_Review'].unique())
print(user_reviews_data.loc[:,'Sentiment'].unique())
print(user_reviews_data.loc[:,'Sentiment_Polarity'].unique())
print(user_reviews_data.loc[:,'Sentiment_Subjectivity'].unique())


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

#Handling null values for playstore_data
print(playstore_data.isnull().sum())

print(playstore_data['Rating'].dtype)
sns.histplot(x=playstore_data['Rating'])
plt.show()
playstore_data['Rating'].skew()
playstore_data['Rating'].fillna(playstore_data['Rating'].median(),inplace=True)



In [None]:
print(playstore_data['Type'].dtype)
print(playstore_data['Type'].unique())
playstore_data['Type'] = playstore_data['Type'].replace(['0', 0], playstore_data['Type'].mode()[0])
playstore_data['Type'] = playstore_data['Type'].fillna(playstore_data['Type'].mode()[0])
print(playstore_data['Type'].unique())

In [None]:
print(playstore_data['Content Rating'].dtype)
print(playstore_data['Content Rating'].unique())
playstore_data['Content Rating'] = playstore_data['Content Rating'].fillna(playstore_data['Content Rating'].mode()[0])
print(playstore_data['Content Rating'].unique())


In [None]:
print(playstore_data['Current Ver'].dtype)
print(playstore_data['Current Ver'].unique())
playstore_data['Current Ver'] = playstore_data['Current Ver'].fillna(playstore_data['Current Ver'].mode()[0])
print(playstore_data['Current Ver'].unique())


In [None]:
print(playstore_data['Android Ver'].dtype)
print(playstore_data['Android Ver'].unique())
playstore_data['Android Ver'] = playstore_data['Android Ver'].fillna(playstore_data['Android Ver'].mode()[0])
print(playstore_data['Android Ver'].unique())


In [None]:
playstore_data.isnull().sum()


In [None]:
#Handling null values for user_reviews_data
print(user_reviews_data.isnull().sum())

In [None]:
print(user_reviews_data['Translated_Review'].dtype)
print(user_reviews_data['Translated_Review'].unique())
user_reviews_data['Translated_Review'] = user_reviews_data['Translated_Review'].fillna(user_reviews_data['Translated_Review'].mode()[0])
print(user_reviews_data['Translated_Review'].unique())


In [None]:
print(user_reviews_data['Sentiment'].dtype)
print(user_reviews_data['Sentiment'].unique())
user_reviews_data['Sentiment'] = user_reviews_data['Sentiment'].fillna(user_reviews_data['Sentiment'].mode()[0])
print(user_reviews_data['Sentiment'].unique())

In [None]:
print(user_reviews_data['Sentiment_Polarity'].dtype)
print(user_reviews_data['Sentiment_Polarity'].unique())
user_reviews_data['Sentiment_Polarity'].skew()
user_reviews_data['Sentiment_Polarity'].fillna(user_reviews_data['Sentiment_Polarity'].median(),inplace=True)

In [None]:
print(user_reviews_data['Sentiment_Subjectivity'].dtype)
print(user_reviews_data['Sentiment_Subjectivity'].unique())
user_reviews_data['Sentiment_Subjectivity'].skew()
user_reviews_data['Sentiment_Subjectivity'].fillna(user_reviews_data['Sentiment_Subjectivity'].median(),inplace=True)

In [None]:
print(user_reviews_data.isnull().sum())

In [None]:
#Remove duplicate values
playstore_data=playstore_data.drop_duplicates()
print(playstore_data.duplicated().sum())
print(playstore_data.shape)

user_reviews_data=user_reviews_data.drop_duplicates()
print(user_reviews_data.duplicated().sum())
print(user_reviews_data.shape)


### What all manipulations have you done and insights you found?

**Play Store Data (playstore_data)**

Manipulation Done: Checked unique values for all columns.
Insight: Found missing values in Rating, Type, Content Rating, Current Ver, Android Ver. Also found anomalies like Category='1.9' and Rating=19.

Manipulation Done: Filled missing Rating values with median.
Insight: Ensures ratings are numeric and usable; median avoids bias from outliers.

Manipulation Done: Replaced Type '0' and nan with mode (Free or Paid).
Insight: All apps now correctly labeled as Free or Paid; no missing or invalid entries.

Manipulation Done: Filled missing Content Rating with mode (Everyone).
Insight: Categorical column complete; no gaps in audience rating.

Manipulation Done: Filled missing Current Ver and Android Ver with mode.
Insight: Version info complete; ready for version-based analysis.

Manipulation Done: Checked data types of all columns.
Insight: Numeric columns (Rating) confirmed as float; categorical columns as object.

Manipulation Done: Removed duplicates.
Insight: Dataset now has 10358 unique apps.

**User Reviews Data (user_reviews_data)**

Manipulation Done: Checked unique values for all columns.
Insight: Found many missing values in Translated_Review, Sentiment, Sentiment_Polarity, Sentiment_Subjectivity.

Manipulation Done: Filled missing Translated_Review with mode (most common review).
Insight: All review texts complete; no missing review content.

Manipulation Done: Filled missing Sentiment with mode.
Insight: Sentiment column complete; can classify reviews as Positive, Neutral, Negative.

Manipulation Done: Filled missing Sentiment_Polarity and Sentiment_Subjectivity with median.
Insight: Numeric sentiment columns complete and ready for analysis; median avoids bias from outliers.

Manipulation Done: Removed duplicates.
Insight: Dataset now has 30679 unique reviews.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

# Univariate Analysis
histogram : 1num col

count : 1cat col

box plot: 1num col

#### Chart - 1

In [None]:
#Histogram
# Distribution of App Ratings
sns.histplot(playstore_data['Rating'], bins=20, kde=True)
plt.title('Distribution of App Ratings')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

To see how app ratings are distributed.

##### 2. What is/are the insight(s) found from the chart?

Most apps have ratings between 4 and 4.5, few below 3.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on apps with high ratings to maintain user trust.

Low-rated apps may reduce overall store credibility.

#### Chart - 2

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(y='Category', data=playstore_data, order=playstore_data['Category'].value_counts().index)
plt.title('Number of Apps per Category')
plt.xlabel('Count')
plt.ylabel('Category')
plt.show()


##### 1. Why did you pick the specific chart?

To see which app categories have more apps.

##### 2. What is/are the insight(s) found from the chart?

FAMILY, GAME, TOOLS have the most apps; WEATHER, PARENTING have fewer.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Invest in popular categories to reach more users.

Popular categories are crowded, making it harder for new apps to succeed.

#### Chart - 3

In [None]:
sns.boxplot(x='Type', y='Rating', data=playstore_data)
plt.title('Rating Distribution for Free vs Paid Apps')
plt.xlabel('Type')
plt.ylabel('Rating')
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
sns.scatterplot(x='Reviews', y='Rating', data=playstore_data)
plt.title('Rating vs Number of Reviews')
plt.xlabel('Reviews')
plt.ylabel('Rating')
plt.xscale('log')
plt.show()


##### 1. Why did you pick the specific chart?

To check relationship between reviews and ratings.

##### 2. What is/are the insight(s) found from the chart?

Apps with more reviews usually have ratings 4–4.5.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Promote apps with many reviews to increase downloads.

Apps with few reviews and low ratings may hurt brand image.

#### Chart - 5

In [None]:
top_genres = playstore_data['Genres'].value_counts().head(10)
top_genres.plot(kind='barh', figsize=(10,6))
plt.title('Top 10 Genres by App Count')
plt.xlabel('Number of Apps')
plt.ylabel('Genre')
plt.show()


##### 1. Why did you pick the specific chart?

To identify most popular genres.

##### 2. What is/are the insight(s) found from the chart?

Most apps are in Tools, Entertainment, Education.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on popular genres to attract users.

High saturation reduces visibility for new apps.

#### Chart - 6

In [None]:
sns.boxplot(x='Content Rating', y='Rating', data=playstore_data)
plt.title('Rating vs Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Rating')
plt.show()


##### 1. Why did you pick the specific chart?

To see how rating varies by age group.

##### 2. What is/are the insight(s) found from the chart?

Apps for Everyone have higher ratings than Mature 17+.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Target family-friendly apps for higher engagement.

Mature apps may have fewer downloads, limiting revenue.

#### Chart - 7

In [None]:
# Step 1: Replace 'Free' with 0
playstore_data['Price_Clean'] = playstore_data['Price'].replace('Free', '0')

# Step 2: Remove $ sign
playstore_data['Price_Clean'] = playstore_data['Price_Clean'].str.replace('$','', regex=False)

# Step 3: Keep only numeric values, replace invalid ones with 0
playstore_data['Price_Clean'] = pd.to_numeric(playstore_data['Price_Clean'], errors='coerce').fillna(0)

# Step 4: Plot
sns.histplot(playstore_data['Price_Clean'], bins=20, kde=True)
plt.title('Distribution of App Prices')
plt.xlabel('Price ($)')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

To see how app prices are distributed.




##### 2. What is/are the insight(s) found from the chart?

Most apps are free.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Free apps attract more users.

High-priced apps may have low adoption.

#### Chart - 8

In [None]:
sns.scatterplot(x='Installs', y='Rating', data=playstore_data)
plt.title('Installs vs Rating')
plt.xlabel('Installs')
plt.ylabel('Rating')
plt.xscale('log')
plt.show()


##### 1. Why did you pick the specific chart?

To see if highly installed apps are highly rated.

##### 2. What is/are the insight(s) found from the chart?

Highly installed apps usually maintain high ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus on promoting apps with many installs.

#### Chart - 9

In [None]:
sns.countplot(x='Type', data=playstore_data)
plt.title('Count of Free vs Paid Apps')
plt.show()


##### 1. Why did you pick the specific chart?

To compare the number of Free vs Paid apps.

##### 2. What is/are the insight(s) found from the chart?

Majority of apps are free.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Free apps reach more users

Paid apps need extra quality to compete.

#### Chart - 10

In [None]:
sns.histplot(user_reviews_data['Sentiment_Polarity'], bins=20, kde=True)
plt.title('Distribution of Sentiment Polarity')
plt.xlabel('Polarity')
plt.ylabel('Count')
plt.show()


##### 1. Why did you pick the specific chart?

To see how positive or negative reviews are.

##### 2. What is/are the insight(s) found from the chart?

Most reviews are slightly positive (Polarity > 0).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive reviews help increase user trust.

Negative reviews may reduce downloads if frequent.

#### Chart - 11

In [None]:
sns.countplot(x='Sentiment', data=user_reviews_data)
plt.title('Count of Sentiments')
plt.show()


##### 1. Why did you pick the specific chart?

To see distribution of Positive, Neutral, Negative reviews.

##### 2. What is/are the insight(s) found from the chart?

Most reviews are Positive.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Overall positive sentiment boosts app reputation.

Negative reviews highlight areas needing improvement.

#### Chart - 12

In [None]:
sns.scatterplot(x='Sentiment_Polarity', y='Sentiment_Subjectivity', data=user_reviews_data)
plt.title('Polarity vs Subjectivity of Reviews')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Subjectivity')
plt.show()


##### 1. Why did you pick the specific chart?

To see relation between polarity and subjectivity.

##### 2. What is/are the insight(s) found from the chart?

Positive reviews are more subjective, neutral ones less.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understand user opinions better to improve apps.

#### Chart - 13

In [None]:
top_apps_reviews = user_reviews_data['App'].value_counts().head(10)
top_apps_reviews.plot(kind='barh', figsize=(10,6))
plt.title('Top 10 Apps by Number of Reviews')
plt.xlabel('Number of Reviews')
plt.ylabel('App')
plt.show()


##### 1. Why did you pick the specific chart?

To find which apps have most reviews.

##### 2. What is/are the insight(s) found from the chart?

Top apps get most user attention and feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Focus updates and improvements on top-reviewed apps.

Apps with very low reviews might be ignored by users.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Select numeric columns only
numeric_data = playstore_data[['Rating', 'Reviews']].copy()

# Convert Reviews to numeric
numeric_data['Reviews'] = pd.to_numeric(numeric_data['Reviews'], errors='coerce')

# Create correlation matrix
corr = numeric_data.corr()

# Plot heatmap
plt.figure(figsize=(6,4))
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()


##### 1. Why did you pick the specific chart?

To understand the relationship between numeric variables using correlation values.

##### 2. What is/are the insight(s) found from the chart?

Ratings and Reviews show a weak positive correlation.

#### Chart - 15 - Pair Plot

In [None]:
# Prepare numeric data
pair_data = playstore_data[['Rating', 'Reviews']].copy()

# Convert Reviews to numeric
pair_data['Reviews'] = pd.to_numeric(pair_data['Reviews'], errors='coerce')

# Create pair plot
sns.pairplot(pair_data)
plt.show()


##### 1. Why did you pick the specific chart?

To visualize multiple relationships between numeric variables at once.

##### 2. What is/are the insight(s) found from the chart?

Ratings are mostly clustered between 4 and 4.5.

Apps with higher reviews tend to maintain stable ratings.

## **5. Solution to Business Objective**


#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

**1. Focus on High-Rated Apps**
Apps with ratings between 4–4.5 perform better and gain more downloads, so improving app quality and user experience should be a top priority.

**2. Invest in Popular Categories**
Categories like GAME, FAMILY, TOOLS, and ENTERTAINMENT attract more users, making them ideal for growth and marketing efforts.

**3. Encourage User Reviews**
Apps with more reviews tend to maintain stable ratings, so prompting users to leave reviews can increase app visibility and trust.

**4. Improve Low-Rated Apps**
Apps with low ratings but high installs need quality improvements to avoid losing users and harming brand reputation.

**5. Adopt a Balanced Pricing Strategy**
Free apps attract more users, while high-quality paid apps generate revenue—use freemium or trial models to balance both.

**6. Use Sentiment Analysis for Feedback**
Negative reviews highlight issues that should be fixed quickly, while positive feedback helps identify successful features.

**7. Optimize App Size and Performance**
Since app size does not strongly affect ratings, focus on performance optimization rather than size reduction alone.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***