# **Project Name - PLAY STORE APP REVIEW AND ANALYSIS **    -



##### **Project Type**    - EDA/Regression/Classification/Unsupervised
##### **Contribution**    - Individual

# **Project Summary -**

This project focuses on exploring and analyzing the Play Store apps data and user reviews to identify key factors driving app engagement and success. The initial data loading and wrangling steps involved:

Loading Datasets: Two datasets, Play Store Data.csv and User Reviews.csv, were loaded into pandas DataFrames.
Handling Missing Values: Missing values in both datasets were removed using the dropna() method.
Handling Duplicates: Duplicate rows were removed from both datasets using the drop_duplicates() method.
Data Cleaning and Type Conversion:
The 'Installs' column in the Play Store data was cleaned by removing ',' and '+' characters and converted to a numeric type.
The 'Price' column was cleaned by removing the '$' sign and converted to a numeric type.
The 'Size' column was processed to convert 'M' and 'k' to their numerical equivalents, handle 'Varies with device' by replacing it with NaN, and then impute missing values with the median size. The column was converted to a numeric type.
The 'Reviews' column was converted to a numeric type.
Merging Datasets: The two dataframes were merged based on the 'App' column using an inner merge to create a combined dataframe (merge_df) containing app details and their corresponding user reviews and sentiment analysis.
The merged dataset provides a comprehensive view, allowing for the analysis of relationships between app characteristics (like category, rating, size, installs, type, and price) and user sentiment metrics (sentiment, sentiment polarity, and sentiment subjectivity). The next steps will involve visualizing these relationships and extracting actionable insights to understand what drives app success in the Play Store.



# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Explore and analyse the data to discover key factors responsible for app engagement and success..**

#### **Define Your Business Objective?**

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. Each app (row) has values for category, rating, size, and more. Another dataset contains customer reviews of the android apps.Explore and analyse the data to discover key factors responsible for app engagement and success.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
# Dataset First view
data1 = pd.read_csv('/content/drive/MyDrive/EDA/Play Store Data.csv')
data2 = pd.read_csv('/content/drive/MyDrive/EDA/User Reviews.csv')

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(data1.shape)
print(data2.shape)

### Dataset Information

In [None]:
# Dataset Info
data1.info()
data2.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data1.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
data1.isnull().sum()


In [None]:
# Visualizing the missing values
sns.heatmap(data1.isnull(), cbar=False)


### What did you know about your dataset?

This dataset appears to contain information about various mobile applications, including their names, categories, ratings, reviews, size, installation numbers, type, price, content rating, genres, last update, current version, and Android version. The presence of missing values suggests that some data points may be incomplete or unavailable for certain applications.Answer Here

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(data1.columns)
print(data2.columns)

In [None]:
# Dataset Describe
data1.describe()


In [None]:
# prompt: # data describe

data2.describe()

### Variables Description

Sentiment_Polarity:

Description: This variable represents the sentiment polarity of a given text, ranging from -1 to 1.
Interpretation:
-1 indicates a negative sentiment
0 indicates a neutral sentiment
1 indicates a positive sentiment
Mean: 0.182146, indicating that on average, the texts have a slightly positive sentiment.
Standard Deviation: 0.351301, suggesting a moderate amount of variation in the sentiment polarity.
Minimum: -1.0, meaning there are some texts with a negative sentiment.
25th Percentile: 0.0, indicating that 25% of the texts have a neutral sentiment.
Median: 0.150000, suggesting that 50% of the texts have a positive sentiment.
75th Percentile: 0.400000, meaning that 75% of the texts have a sentiment polarity less than 0.4.
Maximum: 1.0, indicating that there are some texts with a strongly positive sentiment.
Sentiment_Subjectivity: This variable represents the subjectivity of the sentiment in the given text, ranging from 0 to 1.
Interpretation:
0 indicates a very objective text
1 indicates a very subjective text
Mean: 0.492704, suggesting that on average, the texts have a moderate level of subjectivity.
Standard Deviation: 0.259949, indicating a relatively low amount of variation in the subjectivity.
Minimum: 0.0, meaning there are some very objective texts.
25th Percentile: 0.357143, indicating that 25% of the texts have a subjectivity level less than 0.357143.
Median: 0.514286, suggesting that 50% of the texts have a subjectivity level less than 0.514286.
75th Percentile: 0.650000, meaning that 75% of the texts have a subjectivity level less than 0.65.
Maximum: 1.0, indicating that there are some very subjective texts.
Overall, this dataset appears to contain sentiment analysis metrics for a large number of texts, with the Sentiment_Polarity variable capturing the positivity or negativity of the sentiment, and the Sentiment_Subjectivity variable capturing the level of subjectivity in the texts.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
data1.nunique()
data2.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# Handle missing values
data1.dropna(inplace=True)
data2.dropna(inplace=True)

# Handle duplicates
data1.drop_duplicates(inplace=True)
data2.drop_duplicates(inplace=True)

# Clean 'Installs' column in data1
data1['Installs'] = data1['Installs'].str.replace(',', '').str.replace('+', '')
data1['Installs'] = pd.to_numeric(data1['Installs'])

# Clean 'Price' column in data1
data1['Price'] = data1['Price'].str.replace('$', '')
data1['Price'] = pd.to_numeric(data1['Price'])

# Clean 'Size' column in data1
# Convert 'M' to megabytes and 'k' to kilobytes
data1['Size'] = data1['Size'].apply(lambda x: str(x).replace('M', 'e6').replace('k', 'e3').replace(',', ''))
# Convert 'Varies with device' to NaN and then fill with median (or another strategy)
data1['Size'] = data1['Size'].replace('Varies with device', np.nan)
data1['Size'] = pd.to_numeric(data1['Size'])
data1['Size'].fillna(data1['Size'].median(), inplace=True)


# Convert 'Reviews' to numeric
data1['Reviews'] = pd.to_numeric(data1['Reviews'])

# Merge the two dataframes
merge_df = pd.merge(data1, data2, on='App', how='inner') # Using inner merge to keep only matching apps

# Display information about the merged dataframe
print("Merged Dataframe Info:")
merge_df.info()

# Display descriptive statistics for the merged dataframe
print("\nMerged Dataframe Description:")
display(merge_df.describe())

# Display the first few rows of the merged dataframe
print("\nMerged Dataframe Head:")
display(merge_df.head())

# Check for missing and duplicate values in the merged dataframe
print("\nMissing values in Merged Dataframe:")
print(merge_df.isnull().sum())

print("\nDuplicate values in Merged Dataframe:")
print(merge_df.duplicated().sum())

### What all manipulations have you done and insights you found?

Dataset 1:

This dataset has 8,886 rows and 13 columns.
The columns include 'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', and 'Android Ver'.
The data types are mostly 'object', except for the 'Rating' column which is of 'float64' data type.

Dataset 2:

This dataset has 29,692 rows and 5 columns.
The columns include 'App', 'Translated_Review', 'Sentiment', 'Sentiment_Polarity', and 'Sentiment_Subjectivity'.
The data types are 'object' for 'App', 'Translated_Review', and 'Sentiment', and 'float64' for 'Sentiment_Polarity' and 'Sentiment_Subjectivity'.

Based on the information provided, it seems that you have not performed any specific data manipulations yet. The key insights that can be drawn from the current state of the datasets are:

Dataset 1:

This dataset contains information about various mobile applications, including their names, categories, ratings, reviews, size, installation numbers, type, price, content rating, genres, last update, current version, and Android version.
The presence of missing values (non-null count is less than the total rows) suggests that some data points may be incomplete or unavailable for certain applications.
The majority of the columns are of 'object' data type, which may require further data type conversions or cleaning depending on the analysis requirements.
Dataset 2:

This dataset appears to contain sentiment analysis data, including the app name, translated review text, sentiment label, sentiment polarity, and sentiment subjectivity.
All the columns have complete data (no missing values).
The 'Sentiment_Polarity' and 'Sentiment_Subjectivity' columns are of 'float64' data type, which is suitable for numerical analysis.
To proceed with the data analysis, you may want to consider the following steps:

Dataset 1:

Handle the missing values in the dataset, either by dropping the rows or imputing the missing values.
Convert the data types of the relevant columns to the appropriate types (e.g., converting 'Size', 'Installs', 'Price' to numeric types).
Perform exploratory data analysis to understand the distribution, relationships, and patterns in the dataset.
Dataset 2:

Explore the sentiment analysis data, such as the distribution of sentiment labels, polarity, and subjectivity.
Investigate any relationships between the sentiment metrics and the app information from the first dataset (e.g., app category, rating, reviews).
Consider performing additional sentiment analysis or text processing on the 'Translated_Review' column.
By combining insights from both datasets, you can gain a more comprehensive understanding of the mobile app market and the sentiment associated with the apps.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
#histogram of app ratings
plt.figure(figsize=(8, 6))
plt.hist(merge_df['Rating'], bins=20, color='olive', edgecolor='black') # Using a single color for histogram
plt.title('Distribution of App Ratings')
plt.xlabel('Rating')
plt.ylabel('count')
plt.show()

##### 1. Why did you pick the specific chart?

In the context of the app details dataset, a histogram is a natural choice to explore the distribution of the app ratings. It allows you to quickly get a sense of how the ratings are spread out and identify any interesting patterns or anomalies in the data. This information can then be used to inform further analysis, such as investigating factors that contribute to higher or lower app ratings.

##### 2. What is/are the insight(s) found from the chart?

 The insights gained from the histogram can guide further data analysis, such as exploring factors that contribute to higher or lower app ratings, identifying target segments for marketing or feature development, or investigating the reasons behind unusual rating patterns.Overall, the histogram of app ratings can provide a wealth of information about the distribution, central tendency, variability, and potential anomalies within the dataset. This can lead to valuable insights that can inform strategicdecision-making and product development for the apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights from the histogram can inform the business's strategic decisions, such as which apps to focus on, which features to prioritize, or which market segments to target, ultimately leading to better resource allocation and higher profitability. The histogram provides a high-level view of the rating distribution, but it does not capture the nuanced feedback and insights that can be gleaned from customer reviews and comments. Overemphasizing the histogram at the expense of qualitative customer feedback could lead to a myopic understanding of user needs.In summary, the insights gained from the histogram of app ratings can have a direct and positive impact on the business's performance, enabling them to make data-driven decisions, optimize their app portfolio, and better cater to the needs of their customers, ultimately driving growth and success in the competitive app market.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
plt.figure(figsize=(12, 8)) # Increased figure size
sns.countplot(y='Category', data=merge_df, order = merge_df['Category'].value_counts().index, palette='viridis') # Added color palette
plt.title('Distribution of App Categories')
plt.xlabel('Count')
plt.ylabel('Category')
plt.show() # Added plt.show() to display the plot

##### 1. Why did you pick the specific chart?

A bar plot is a suitable choice to visualize the relationship between app category (a categorical variable) and app rating (a numerical variable). It allows us to compare the average rating across different app categories and identify which categories tend to have higher or lower ratings.



##### 2. What is/are the insight(s) found from the chart?

Looking at the box plot of app ratings by category, we can observe the distribution of ratings within each category. The countshows the interquartile range (IQR), with the median rating marked within the box. The whiskers extend to show the range of the data, excluding outliers, which are plotted as individual points. This allows us to compare the median rating, the spread of ratings, and the presence of outliers across different app categories. Categories with higher median ratings and smaller boxes might indicate more consistent positive reviews, while those with lower medians or larger boxes might suggest more variability or lower average satisfaction. Outliers could represent exceptionally good or bad apps within a category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the box plot of app ratings by category can definitely have a positive business impact. By identifying categories with consistently high ratings, a business can focus its development and marketing efforts on those areas, potentially leading to higher user satisfaction and engagement. Conversely, categories with lower median ratings or a wide spread of ratings might indicate areas where improvement is needed, or perhaps categories to avoid if resources are limited. Ignoring these insights could lead to negative growth if a business invests heavily in a category with poor performance or high user dissatisfaction. For example, consistently low ratings in a particular category could signal fundamental issues with the apps in that category, leading to negative word-of-mouth and decreased user acquisition.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Reviews', y='Rating', data=merge_df, color='teal') # Using a single color for scatter plot
plt.xscale('log') # Apply logarithmic scale to the x-axis
plt.title('Scatter Plot of Reviews vs. Rating (Log Scale for Reviews)')
plt.xlabel('Reviews (Log Scale)')
plt.ylabel('Rating')
plt.show()

##### 1. Why did you pick the specific chart?

A scatter plot is a good choice to visualize the relationship between two numerical variables, in this case, 'Reviews' and 'Rating'. It allows us to see if there is a correlation or pattern between the number of reviews an app has and its rating. Each point on the scatter plot represents an app, with its position determined by its number of reviews on the x-axis and its rating on the y-axis.

##### 2. What is/are the insight(s) found from the chart?

Looking at the scatter plot of Reviews vs. Rating, we can observe the relationship between the number of reviews an app has and its rating. Generally, we might expect to see a trend where apps with more reviews tend to have higher ratings, as more popular apps often receive more feedback, and successful apps are likely to have positive reviews. However, the scatter plot will show if this is consistently true or if there are exceptions. We can look for clusters of points, trends (positive or negative correlation), and outliers (apps with a high number of reviews but a low rating, or vice versa).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In summary, the scatter plot can be a powerful tool for gaining insights into the relationship between reviews and ratings, which can inform business decisions and lead to positive growth. However, it's crucial to interpret the insights carefully, consider other factors, and complement the quantitative analysis with qualitative data to avoid potentially negative outcomes.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='Type', data=merge_df, palette='plasma') # Added color palette
plt.title('Count Plot of App Types')
plt.xlabel('Type')
plt.ylabel('Count')
plt.xticks(rotation=0) # Removed rotation for better readability with only two categories
plt.tight_layout() # Adjust layout to prevent labels from being cut off
plt.show() # Added plt.show() to display the plot

##### 1. Why did you pick the specific chart?

A count plot is used to display the counts of observations in each categorical bin. In this case, it's used to show the distribution of app types ('Free' vs. 'Paid') and how many apps fall into each category. This helps visualize the proportion of free and paid apps in the dataset.

##### 2. What is/are the insight(s) found from the chart?

Looking at the count plot of app types, we can see the number of apps that are 'Free' and the number of apps that are 'Paid'. The most obvious insight is which type of app is more prevalent in the dataset. We can also see the relative proportion of free versus paid apps. For example, if the 'Free' bar is significantly taller than the 'Paid' bar, it indicates that there are many more free apps than paid apps in this datase

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 the insights from the count plot of app types can be valuable for strategic decision-making, particularly regarding pricing and monetization. However, it's essential to consider these insights in conjunction with other factors, such as user reviews and the success of individual apps, to avoid potentially negative outcomes from a limited perspective.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='Content Rating', y='Rating', data=merge_df, palette='tab10') # Added color palette
plt.title('Box Plot of App Ratings by Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Rating')
plt.show() # Added plt.show() to display the plot

##### 1. Why did you pick the specific chart?

A box plot is a suitable choice to visualize the relationship between a categorical variable ('Content Rating') and a numerical variable ('Rating'). It allows us to compare the distribution of ratings across different content ratings and identify any differences in the median rating, spread, or presence of outliers among them.

##### 2. What is/are the insight(s) found from the chart?

Looking at the box plot of App Ratings by Content Rating, we can compare the distribution of ratings for different content ratings (e.g., Everyone, Teen, Mature 17+, etc.). The box plot shows the median rating for each content rating category, as well as the spread of ratings within each category (the interquartile range) and any outliers. This allows us to see if certain content ratings tend to have higher or lower median ratings, or if the variability in ratings differs significantly across content ratings. For instance, we might observe if apps rated 'Everyone' tend to have a different rating distribution compared to apps rated 'Mature 17+'.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

In summary, the insights from the box plot of App Ratings by Content Rating can be valuable for understanding target audiences, informing content and marketing strategies, and identifying potential risks. However, it's crucial to interpret these insights in a broader context and consider other factors to avoid potentially negative outcomes.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='Installs', data=merge_df, palette='viridis') # Added color palette
plt.title('Count Plot of App Installs')
plt.xlabel('Installs')
plt.ylabel('Count')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from being cut off
plt.show()

##### 1. Why did you pick the specific chart?

A count plot for 'Installs'  is used to visualize the distribution of the number of installs across different apps. Since the 'Installs' column is likely categorical (or binned into categories), a count plot shows how many apps fall into each install range.

##### 2. What is/are the insight(s) found from the chart?

This chart shows the frequency of apps within different installation ranges. You can see which installation tiers are most common in the dataset. For example, you might observe that a large number of apps fall into the lower installation ranges (e.g., 100+, 1,000+), while fewer apps have very high installation counts (e.g., 1,000,000+, 10,000,000+).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Identifying the most common installation tiers helps set realistic growth targets and understand market penetration. Strategies can be tailored to move from lower to higher install brackets.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='Genres', y='Rating', data=merge_df, palette='husl') # Added color palette
plt.title('Box Plot of App Ratings by Genres')
plt.xlabel('Genres')
plt.ylabel('Rating')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from being cut off
plt.show()

##### 1. Why did you pick the specific chart?

A box plot is a suitable choice to visualize the relationship between 'Genres' (a categorical variable) and 'Rating' (a numerical variable). Similar to the box plot for categories, it allows us to compare the distribution of ratings across different app genres and identify any differences in the median rating, spread, or presence of outliers among them.

##### 2. What is/are the insight(s) found from the chart?

This chart will show the distribution of app ratings within each genre. You can identify genres that tend to have higher or lower median ratings, as well as those with a wider or narrower spread of ratings. Outliers in specific genres might indicate exceptionally good or bad apps within that genre. This helps in understanding how app ratings vary across different types of content and functionality.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Identifying genres with consistently high ratings can inform a business about potentially lucrative areas for app development or investment. Understanding the rating distribution within a genre helps in setting realistic performance expectations.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='Price', data=merge_df, palette='Spectral') # Added color palette
plt.title('Count Plot of App Prices')
plt.xlabel('Price')
plt.ylabel('Count')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from being cut off
plt.show() # Added plt.show() to display the plot

##### 1. Why did you pick the specific chart?

The count plot for 'Price' shows the distribution of pricing strategies among apps. You can see how many apps are free, how many are in different price tiers, and identify the most common pricing models. This can provide insights into the market's pricing trends.

##### 2. What is/are the insight(s) found from the chart?

 This chart illustrates the distribution of app pricing models. You can see the proportion of free apps compared to paid apps, and the frequency of apps at different price points. It's highly likely that the "Free" category will dominate this plot, indicating that the vast majority of apps in the dataset are free.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Understanding the price distribution helps in positioning a new app competitively. It can highlight potential niches for paid apps if most competitors are free.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='Android Ver', y='Rating', data=data1)
plt.title('Box Plot of App Ratings by Android Versions')
plt.xlabel('Android Ver')
plt.ylabel('Rating')
plt.xticks(rotation=90) # Rotate x-axis labels for better readability

##### 1. Why did you pick the specific chart?

 A box plot is suitable for comparing the distribution of a numerical variable (Rating) across different categorical groups, in this case, the various Android versions. It allows for a clear visual comparison of the median rating, the spread of ratings, and the presence of outliers for each Android version.

##### 2. What is/are the insight(s) found from the chart?

This chart will reveal if app ratings vary significantly depending on the Android version. You can see if certain Android versions are associated with higher or lower median ratings, or if the variability in ratings differs across versions. This can highlight potential compatibility issues or performance differences on certain Android versions that affect user satisfaction.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

 Identifying Android versions where an app performs well (higher ratings) can help in targeting marketing efforts towards users of those versions. Understanding which versions have lower ratings can inform development priorities to address compatibility or performance issues, leading to improved user satisfaction and ratings on those versions.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 6))
sns.countplot(x='Sentiment', data=merge_df, palette='coolwarm') # Added color palette
plt.title('Count Plot of Sentiment Labels')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show() # Added plt.show() to display the plot

##### 1. Why did you pick the specific chart?

A count plot is the ideal choice for visualizing the frequency of each category in a categorical variable like 'Sentiment'. It clearly shows how many reviews fall into each sentiment class (Positive, Negative, Neutral), providing a quick overview of the overall sentiment towards the apps.

##### 2. What is/are the insight(s) found from the chart?

 This chart will show the proportion of positive, negative, and neutral reviews. You can see which sentiment is most prevalent and the relative balance between positive and negative feedback. A large number of positive reviews is a good sign, while a significant number of negative reviews warrants further investigation.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the overall sentiment distribution helps gauge user satisfaction. A high proportion of positive reviews indicates a generally well-received app or set of apps, which can be leveraged in marketing. Identifying a large number of negative reviews highlights areas that urgently need attention and improvement.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Sentiment_Polarity', y='Sentiment_Subjectivity', data=merge_df, hue='Sentiment', palette='viridis') # Added color based on Sentiment
plt.title('Scatter Plot of Sentiment Polarity vs. Sentiment Subjectivity')
plt.xlabel('Sentiment Polarity')
plt.ylabel('Sentiment Subjectivity')
plt.show()

##### 1. Why did you pick the specific chart?

 Assuming 'Sentiment_Polarity' is categorized, a scatter plot is useful for comparing the distribution of 'Sentiment_Subjectivity' (a numerical variable) across different categories of 'Sentiment_Polarity'. It helps to see if the subjectivity of reviews varies depending on whether the sentiment is positive, negative, or neutral.

##### 2. What is/are the insight(s) found from the chart?

 This chart would show the distribution of how subjective or objective reviews are for different sentiment levels. For example, you might see if highly positive or highly negative reviews tend to be more subjective than neutral reviews. This can provide insights into the nature of emotional vs. objective feedback.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Understanding the relationship between sentiment polarity and subjectivity can help in interpreting reviews. Highly subjective positive reviews might indicate strong emotional attachment, while objective negative reviews might point to specific functional issues. This can inform how to prioritize and act on feedback.

#### Chart - 12 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10, 6))
numeric_merge_df = merge_df.select_dtypes(include=np.number) # Select only numeric columns
sns.heatmap(numeric_merge_df.corr(), annot=True, cmap='coolwarm') # Added color palette
plt.title('Correlation Heatmap of Merged Data')
plt.show()

##### 1. Why did you pick the specific chart?

A heatmap is an excellent choice for visualizing the correlation matrix of numerical variables. The color intensity and annotations make it easy to see the strength and direction of the linear relationship between each pair of variables. This is particularly useful in exploratory data analysis to identify potential relationships that warrant further investigation.

##### 2. What is/are the insight(s) found from the chart?

The heatmap will show the correlation coefficients between numerical columns in the data1 DataFrame (like Rating, Reviews, Size, etc., after they have been converted to numeric types where applicable). You can identify strong positive correlations (variables that increase together), strong negative correlations (variables where one increases as the other decreases), and weak or no correlations. For example, you might see a positive correlation between 'Reviews' and 'Installs', suggesting that apps with more reviews also tend to have more installs.

#### Chart - 13 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(merge_df, height=2.5, hue='Sentiment', palette='viridis') # Increased height for better visibility and added color based on Sentiment
plt.show() # Added plt.show() to display the plot

##### 1. Why did you pick the specific chart?

 A pair plot is chosen to visualize the relationships between all pairs of numerical variables in the dataset in a single view. It helps in quickly identifying trends, correlations, and distributions that might not be apparent from individual plots or correlation heatmaps alone. It provides a comprehensive overview of the pairwise relationships.

##### 2. What is/are the insight(s) found from the chart?

The scatter plots in the off-diagonal cells show the relationship between two different numerical variables. You can observe the shape of the relationship (linear, non-linear), the direction (positive or negative correlation), and the spread of the data. The diagonal plots show the distribution of each individual variable, similar to a histogram. This allows for a detailed examination of how each numerical variable is distributed and how it relates to every other numerical variable in the dataset.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis we've done so far, here are some suggestions for the client to achieve their business objective of discovering key factors responsible for app engagement and success:

Focus on High-Rated Categories and Genres: The analysis of ratings by category and genre can reveal which types of apps are generally more well-received by users. The client should consider developing or focusing on apps within these categories and genres.
Prioritize Apps with High Sentiment Polarity: The user reviews data provides insights into user sentiment. Apps with a higher positive sentiment polarity are likely to have more engaged users. The client should analyze the characteristics of these apps and the factors contributing to positive reviews.
Investigate the Impact of App Size and Updates: While the correlation heatmap didn't show strong correlations with rating, further investigation into how app size and update frequency relate to user engagement and ratings could be beneficial.
Understand User Sentiment Drivers: Delve deeper into the user reviews, especially those with high positive sentiment. Identify common themes, keywords, and features that users praise. This can inform product development and marketing strategies.
Analyze Competitors in Successful Categories: Examine successful apps in high-rated categories and genres to understand their strategies, features, and user engagement tactics.
Consider the Impact of Price and Type: Analyze how the 'Type' (free vs. paid) and 'Price' of an app influence its installs and ratings. This can help in making informed decisions about pricing models.
Refine Content Rating and Android Version Targeting: While not strongly correlated with rating in the initial heatmap, understanding the distribution of installs and ratings across different content ratings and Android versions can help in targeting the right audience and ensuring compatibility.
To further refine these suggestions, we could perform more in-depth analysis, such as Natural Language Processing (NLP) on user reviews to extract specific feedback and identify popular features or pain points.
Segmentation of users based on their reviews and sentiment.
Time series analysis of app updates and their impact on ratings and reviews.

# **Conclusion**

In conclusion, achieving app engagement and success in the Play Store is a multifaceted challenge. While factors like category, genre, and the volume of reviews play a role, a deeper understanding of user sentiment and the specific reasons behind positive and negative feedback, as well as the impact of factors like pricing and updates, is crucial. Future analysis should focus on leveraging the rich qualitative data in user reviews and exploring multivariate relationships to gain a more comprehensive picture of what drives success in this competitive market.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***