<a href="https://colab.research.google.com/github/ganeshsingh9325/Play-Store-App-Review-Analysis---EDA/blob/main/Final_of_EDA_Play_Store_App_Review_Analysis_ipyn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -  **Play Store App Review Analysis**    



##### **Project Type**    - EDA
##### **Contribution**    - Individual

# **Project Summary -**

The objective of this project is to assist a mobile app developer, in achieving their business goals and enhancing their performance in the competitive mobile app market. By analyzing app data, user ratings, and sentiment analysis, we aim to provide valuable insights and recommendations that can guide the client in optimizing their app development strategies, identifying profitable app categories, and improving user satisfaction.

The project begins by examining the relationship between app size and user ratings. Through visualizations such as combination histograms and KDE plots, we analyze the distribution of user ratings across different app sizes. This analysis helps us understand whether there is a correlation between app size and user satisfaction. The insights gained from this exploration can inform the client's decisions regarding app size optimization, ensuring that their apps align with user expectations and preferences.

Next, we immerse into the impact of pricing on app success. By examining app categories and their corresponding prices for paid apps, we can identify trends and patterns that can guide pricing strategies. The use of strip plots allows us to visualize the pricing distribution within each category, helping the client identify categories that command higher prices and are potentially more profitable. This information enables the client to make informed pricing decisions and maximize their revenue generation.

Additionally, we perform sentiment analysis on user reviews to gain deeper insights into user feedback and sentiments. By merging app data with a review dataset, we analyze the sentiment distribution across different app categories. This analysis helps us identify categories that receive positive sentiment and are associated with high user satisfaction. Furthermore, it highlights categories that may face challenges due to negative sentiment, indicating areas where improvements are needed. These insights allow the client to focus on enhancing user satisfaction, addressing negative feedback, and optimizing their app offerings to meet user expectations.

Overall, this project aims to provide the Mobile App Developer with actionable insights and recommendations to achieve their business objectives in the mobile app market. By analyzing app data, user ratings, and sentiments, we assist the client in optimizing their app development practices, identifying profitable app categories, and improving user satisfaction. The outcomes of this project can lead to enhanced app performance, increased user engagement, and ultimately, positive business impact for the client in the competitive mobile app industry.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


 A Mobile App Developer, is facing challenges in achieving their business objectives and wants to improve their performance in the highly competitive app market. They are seeking assistance in identifying the factors that contribute to app success, understanding the relationship between app size and user ratings, and optimizing their pricing strategies to enhance user satisfaction.

 The client also wants to leverage user sentiment analysis to gain insights into user feedback and sentiment patterns. The specific problem is to find effective strategies to optimize app size, identify profitable app categories, address negative sentiment, and make informed decisions regarding app pricing.

The objective is to provide the client with actionable recommendations to overcome these challenges, enhance their app development practices, and drive positive business outcomes in the dynamic app market.

#### **Define Your Business Objective?**

**1. Identifying profitable app categories based on pricing and ratings.**

**2. Optimizing app development strategies by considering app size and ratings.**

**3. Improving user satisfaction and app quality based on sentiment analysis of user reviews.**

**4. Analyzing the impact of app type (free or paid) on user sentiment and ratings.**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import plotly.express as px


### Dataset Loading

In [None]:
# Load Dataset

# read playstore applications data from csv

apps= pd.read_csv('/content/Play Store Data (3).csv')

### Dataset First View

In [None]:
# Dataset First Look

apps

In [None]:
apps.head()

In [None]:
apps.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

apps.shape

### Dataset Information

In [None]:
# Dataset Info

apps.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

apps.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

apps.isnull().any()

In [None]:
apps.isnull().sum()

In [None]:
# Visualizing the missing values

missing_values = apps.isnull().sum()
plt.figure(figsize=(12, 30))
sns.barplot(x=missing_values.index, y=missing_values.values)
plt.xticks(rotation=90)
plt.xlabel('Columns')
plt.ylabel('Missing Value Count')
plt.title('Missing Values in Apps Dataset')
plt.show()





### What did you know about your dataset?

There are 1474 values missing in Rating column, 1 in Type, 1 in Content Rating, 8 in current version, and 3 in android version.
We can drop missing records of Type and Content Rating as its count is much less than the size of dataframe, but we can't drop null values of Rating because it has 1474 record with missing data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

apps.columns

In [None]:
# Dataset Describe

apps.describe(include='all')

### Variables Description

###1. Play Store Data.csv
**App** : Categorical, the app name.

**Category** : Categorical, category the app belongs to.

**Rating** : Numerical, range from 0.0 to 5.0 Rating has received from the users.

**Reviews** : Numerical, the number of reviews that the app received.

**Size** : Numerical, the size of the app. The suffix M - megabytes, K - kilobytes.

**Installs** : Numerical, describes the number of installs.

**Type** : Categorical, a label that indicates whether the app is free or paid.

**Price** : Numerical, the price value for the paid apps.

**Content** Rating : Categorical, a categorical rating that indicates the age group for user.

**Genre** : Categorical, list of genres to which the app belongs.

**Last** Updte : Date Format, the date at which the app was last updated.

**Current** Version : Version of the app as specified by the developers.

**Android** Version : The Android OS the app is compatible with.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

apps.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

# check whether the record with missing type contains some useful information. if yes, donot drop, otherwise drop

apps[apps.Type.isna()==True]


In [None]:
apps.Type.value_counts()

In [None]:
# drop records with missing type and content rating

apps.dropna(subset=['Type', 'Content Rating'], axis=0, inplace=True)

In [None]:
apps.isnull().sum()

## 2. Remove Duplicate Records

As this data is all about applications of google play store, so different information for same application may lead to ambiguity. Remove all duplicates wo can perform analysis on much clean data

In [None]:
apps.duplicated().any()

In [None]:
# drop duplicates and update the dataframe

apps.drop_duplicates(inplace=True)

In [None]:
# cross check is there are still some duplicates

apps.duplicated().any()

In [None]:
# check the number of missing values in Rating after removing duplicates

apps.isna().sum()

## 3. Correcting Data Types of each Attribute

Improper data type of attributes could make analysis more difficult and incorrect, for example if a datetime column is interpreted as string by python then you wouldn't be able to extract useful information from date like month, year or day easily, similarly if integer or float column is interpreted as string then you wouldn't be able to calculate its avergae or total. So, in order to simplify our analysis we should first correct our data types

In [None]:
# finding the data types of each attribute

apps.info()

In [None]:
apps[apps.Type == 'Paid'].head()

Here, we can easily see some special characters in Price and Installs columns. We can't convert string to integer if it doesn't contain all digits and no special characters or alphabets. So, firstly we have to remove these special characters.

In [None]:
# list of characters to remove
chars_to_remove = ['+', ',', '$']

# columns from where to remove special characters
cols_to_clean = ['Reviews', 'Installs', 'Price']

for col in cols_to_clean:
    for char in chars_to_remove:

        apps[col] = apps[col].apply(lambda x: x.replace(char, ''))


apps[apps.Type == "Paid"]






Content Rating column contains some complicated or we can say ambiguous/uncleaned values so replace them with simple ones.

In [None]:
apps['Content Rating'].unique()

In [None]:
# replace 'everyone 10+' to only '10+', 'Mature 17+' to 'Mature', and 'Adults only 18+' to 'Adults' only

apps['Content Rating'].replace(['Everyone 10+', 'Mature 17+', 'Adults only 18+'], ['10+', 'Mature', 'Adults'], inplace=True)

In [None]:
apps['Content Rating'].unique()

Now we have corrected the values of Price, Installs, Reviews (if it contains any special character), and Content Rating. Next step is to change their data types
Content Rating, Type and Category has discrete values thus are categorical so, change their data type to categorical
Price contain floating point number, and Installs and Reviews contain integer values

In [None]:
apps['Content Rating'] = apps['Content Rating'].astype('category')

apps.Price = apps.Price.astype('float64')
apps.Installs = apps.Installs.astype('int')
apps.Reviews = apps.Reviews.astype('int')
apps.Type = apps.Type.astype('category')
apps.Category = apps.Category.astype('category')

In [None]:
# convert genres to category and last updated to datetime object as it contains dates

apps.Genres = apps.Genres.astype('category')
apps['Last Updated'] = pd.to_datetime(apps['Last Updated'])



In [None]:
# check the data type of each attribute

apps.info()

## 4. Select Interested/Informative Attributes Only

To perform on all attributes without having their need usually considered as illogical thing, So sliced your dataframe before doing EDA everytime. Here, I am not interested in doing analysis on current or android versions of application so I m ignoring these attributes and storing all others as apps2

In [None]:
# Pick out all rows and columns from 'App' to 'Last Updated' i.e from 0 to 10

apps2 = apps.iloc[:, :11]
apps2

### What all manipulations have you done and insights you found?

The dataset contains information about mobile applications from the Google Play Store.
There are missing values in the 'Rating' column, which cannot be dropped due to a significant number of missing records.
The dataset had duplicate records, which were removed to ensure data cleanliness.
Certain attributes had special characters or complicated values, which were cleaned and corrected.
Data types of attributes were adjusted to facilitate analysis.
The 'apps2' dataframe contains the selected attributes for further analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 - Histogram - Most Rated Apps (Top 10)

In [None]:
# Chart - 1 visualization code

top_rated = apps2.nlargest(10, 'Rating')


### Categories of Highly Rated Apps

In [None]:
sns.set_style('darkgrid')
plt.hist(top_rated['Genres'])
plt.xlabel('Category of 10 highly rated Apps')
plt.ylabel('Count')
plt.margins(0.10)
plt.show()

##### 1. Why did you pick the specific chart?

the histogram was chosen to represent the categories of highly rated apps because it effectively shows the count or frequency of each category. By using a histogram, it becomes easy to compare the frequency of different categories. In this case, the histogram clearly shows that the 'DATING' category has the highest count (6), followed by 'EVENTS' (3), and 'COMICS' (1). This visual representation allows for a quick and clear understanding of the distribution of highly rated apps across different categories.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

Out of 10 highly rated applications, 6 were of type 'DATING' , 3 were of type 'EVENTS', and only 1 belongs to the category of 'COMICS'



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially help create a positive business impact. By identifying the popular categories of highly rated apps and understanding user preferences and market demand, businesses can focus their resources on developing or improving apps in those categories, increasing their chances of success and growth. However, it's important to note that the absence of certain categories among the highly rated apps does not necessarily imply negative growth. Negative growth can occur due to various factors such as competition, user experience and satisfaction, and market trends. To ensure positive growth, businesses should continuously monitor user feedback, adapt to market dynamics, and provide a compelling user experience that meets or exceeds customer expectations.

#### Chart - 2 - Countplot -   Exploring Categories of applications

In [None]:
# Chart - 2 visualization code

#Number of categories
print("Total Categories: {}".format(len(apps2.Category.unique())))

plt.figure(figsize=[15, 6])
sns.set_context('talk')
sns.countplot(x='Category', data = apps2, order=apps2.Category.value_counts().index)
plt.xticks(rotation=90)

plt.show()

##### 1. Why did you pick the specific chart?

The countplot was chosen as an appropriate chart to represent the number of categories and their counts. It efficiently communicates the distribution of apps across different categories, highlighting the prevalence of certain categories like FAMILY, GAMES, TOOLS, BUSINESS, and MEDICAL, while also allowing for a quick understanding of the relative frequencies of other categories.

##### 2. What is/are the insight(s) found from the chart?

Most of the applications belongs to the category of FAMILY, then GAMES, and then TOOLS, BUSINESS and MEDICAL. The rest have less count or less innovations.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the gained insights can help create a positive business impact. By understanding the distribution of apps across different categories and identifying the most prevalent categories, businesses can make informed decisions regarding product development, marketing strategies, and resource allocation. They can focus their efforts on categories like FAMILY, GAMES, TOOLS, BUSINESS, and MEDICAL, which have a higher count and potentially a larger user base. However, it's important to note that negative growth can occur if businesses solely rely on these insights without considering other crucial factors. Factors such as market saturation, intense competition within popular categories, or evolving user preferences can impact growth. To ensure positive business impact, businesses should also focus on factors like user satisfaction, market trends, and innovation to differentiate themselves and maintain a competitive edge.

#### Chart - 3 - Pie chart-   Impact of Price on app installations

In [None]:
# Chart - 3 visualization code

installs = apps2.groupby('Type')['Installs'].agg('sum')
installs

In [None]:
labels = installs.index
sizes = installs.values

# Create the pie chart
plt.pie(sizes, labels=labels, colors = ['green','yellow'] ,  autopct='%1.0f%%')
plt.title('App Installs by Type')
plt.show()

##### 1. Why did you pick the specific chart?


I selected a pie chart because it is a suitable choice when you want to visually represent the distribution or proportion of different categories or groups as part of a whole. In this case, the data provided included two categories ("Free" and "Paid") and the goal was to show the proportion of app installs for each category. A pie chart allows for easy comparison of the relative sizes of the categories and provides a clear visual understanding of the distribution.

##### 2. What is/are the insight(s) found from the chart?

People usually tend to install and use free application and Paid application people not install or very small proportion of installs in the "Paid" category.Now we can analyze the impact of price on apps rating.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the pie chart can indeed help create a positive business impact by providing valuable information about the distribution of app installs. By understanding the proportion of installs for different categories, such as "Free" and "Paid," businesses can make informed decisions about resource allocation, pricing strategies, and revenue generation opportunities. For example, if the majority of installs are in the "Paid" category, it indicates a potential for direct revenue generation. However, if the insights reveal a very small proportion of installs in the "Paid" category or a significant dominance of free installs, it could indicate a potential negative impact on growth. This might suggest a need to reassess monetization strategies, pricing models, or the overall value proposition of the paid offerings to address market dynamics and drive positive growth.

#### Chart - 4- Histogram with Kernel Density Estimate (KDE) and Hue Differentiation - Distribution of App Ratings

In [None]:
# Chart - 4 visualization code

avg_rating = apps2.Rating.mean()
print("Average Rating: {}".format(avg_rating))

sns.displot(data=apps2, x='Rating', kde=True, hue='Type', height=8, aspect=1.2)
plt.axvline(avg_rating, linestyle='--', color = 'red', linewidth = 2.0)
plt.text(3.3, 800, 'Average Rating -->', color = 'red', fontsize=14)
plt.title('PDF or Distribution of Rating with Type', fontsize=18)
plt.show()

##### 1. Why did you pick the specific chart?

I chose the histogram with KDE and hue differentiation because it effectively displays the distribution of ratings for paid and unpaid applications in a visually appealing manner, allowing for easy comparison. The histogram provides a clear representation of the frequency of ratings, while the KDE curve smooths the distribution for better visibility. The hue differentiation helps identify any differences in the rating distributions between paid and unpaid apps, and the vertical line representing the average rating provides a useful reference point for comparison.

##### 2. What is/are the insight(s) found from the chart?

Mean average rating is approx 4.2. We can say that rating is negatively skewed or exponentially increasing. But for Paid apps ratio of rating is quite less than Free apps. To confirm the distribution of rating we can plot ecdf

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the chart can potentially help create a positive business impact. By examining the distribution of ratings for paid and unpaid applications, businesses can identify trends and patterns that may influence their decision-making processes. For example, if the chart reveals that unpaid applications tend to have higher ratings compared to paid applications, businesses can consider adjusting their pricing strategy or improving the quality of their paid applications to enhance customer satisfaction and potentially increase revenue. However, if the chart indicates that paid applications consistently receive lower ratings compared to free applications, it may suggest a negative growth potential. In such cases, businesses would need to analyze the reasons behind the lower ratings and take corrective actions to improve the user experience and enhance the value proposition of their paid applications in order to mitigate any negative impact.

#### Chart - 5- Empirical Cumulative Distribution Function (ECDF) plot

In [None]:
# Chart - 5 visualization code

# to confirm that the distribution of rating is negatively skewed,
# we are using empirical commulative distribution function

plt.figure(figsize=[15, 6])
sns.ecdfplot(data = apps2, x='Rating', hue='Type', legend=True)
plt.title("ECDF of Rating", fontsize=16)
plt.show()

##### 1. Why did you pick the specific chart?


The ECDF plot was chosen because it provides a clear and intuitive representation of the cumulative distribution of ratings for both free and paid apps. By plotting the empirical cumulative probabilities against the rating values, we can easily visualize how the ratings are distributed across the range. The ECDF plot allows us to assess the skewness of the distribution and observe any differences between the distributions of free and paid apps.

##### 2. What is/are the insight(s) found from the chart?

Free and Paid Apps rating is negatively skewed ass tail extends from left hand side.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The gained insights from the ECDF plot can potentially help create a positive business impact. By analyzing the cumulative distribution of ratings for free and paid apps, businesses can gain a better understanding of customer satisfaction and identify opportunities for improvement. For example, if the ECDF plot shows that a significant portion of ratings for paid apps is concentrated at lower values compared to free apps, it may indicate a potential negative impact on revenue and growth. In such cases, businesses can investigate the reasons behind the lower ratings, address any issues affecting user experience, and implement strategies to enhance the value proposition of their paid apps. This proactive approach can help mitigate negative growth and lead to positive outcomes by improving customer satisfaction and increasing the likelihood of positive reviews and recommendations.

#### Chart - 6 - Combination Histogram and KDE plot - Apps rating with respect to size of application

In [None]:
# Chart - 6 visualization code

apps_with_size = apps2

apps_with_size['size_unit'] = ['KB' if x.endswith('k')\
                                else 'MB' if x.endswith('M') else 'Varies with device'\
                                for x in apps2.Size]
apps_with_size.head()

In [None]:
sns.displot(data=apps_with_size, x='Rating', kde=True, hue='Type', height=4, aspect=1.5, col='size_unit')
plt.show()

##### 1. Why did you pick the specific chart?


I chose the combination of a histogram and KDE plot because it effectively visualizes the distribution of app ratings across different size units and distinguishes between free and paid apps. By utilizing the hue parameter and columns for size units, the chart provides a clear and comprehensive view of the rating distribution in relation to app size and pricing, allowing for easy comparison and analysis.

##### 2. What is/are the insight(s) found from the chart?

Large free applications tends to have more rating as compared to small applications (in kilobytes)
Application with varying size have more rating if they are unpaid.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The gained insights from the combination histogram and KDE plot with hue and columns can potentially help create a positive business impact. By analyzing the rating distribution in relation to app size and pricing, businesses can identify patterns and make informed decisions. For example, they can identify the most common rating ranges for different size units and pricing categories, allowing them to optimize their app development and pricing strategies to align with user preferences. However, one potential insight that could lead to negative growth is if the plot reveals a significant concentration of low ratings for a specific size unit or pricing category. This could indicate user dissatisfaction with apps of that particular size or pricing, prompting businesses to reevaluate their offerings and address any issues causing negative feedback in order to avoid negative growth or loss of customers.

#### Chart - 7 - Strip Plot - Category Vs Price

In [None]:
# Chart - 7 visualization code

#apps_with_price = apps2[apps2.Type == 'Paid']

apps_with_price = apps2.loc[(apps2.Type == 'Paid')]
apps_with_price.head()

In [None]:
plt.figure(figsize = [15, 10])
sns.set_context('paper')
sns.stripplot(y='Category', x='Price', data=apps_with_price,\
              jitter=True, size=6, palette='plasma', marker='D')
plt.axvline(apps_with_price.Price.mean(), linewidth=1.5, linestyle='-')
plt.text(20, 2, '<-- Average Price', fontsize=15, color='red')
plt.show()

##### 1. Why did you pick the specific chart?


I chose the strip plot with a horizontal orientation to visualize the relationship between app categories and their prices for paid apps. This chart allows for a clear comparison of prices across different categories and helps identify any potential trends or variations. By using markers to represent each data point and adding jitter to avoid overlapping, we can effectively observe the distribution of prices within each category. The addition of a vertical line indicating the average price and a text annotation further aids in understanding the overall pricing pattern.

##### 2. What is/are the insight(s) found from the chart?

It's hard to interpret which type of applications have generally have high prices for their installation. So, let's find out those categories with high prices.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights from the strip plot can potentially help create a positive business impact. By analyzing the relationship between app categories and their prices for paid apps, businesses can gain insights into pricing strategies and market trends. They can identify categories that command higher prices and have a more profitable potential. However, one potential insight that could lead to negative growth is if the plot reveals a lack of demand or low prices in certain categories.In the case of our scenerio hard to interpret which type of applications have generally have high prices for their installation.

## Applications with High Prices

#### Chart - 8 - Scatter plot -  Applications with High Prices

In [None]:
# Chart - 8 visualization code

#high_prices = apps_with_price[apps_with_price.Price >= 250]

high_prices = apps_with_price.loc[(apps_with_price.Price >= 250)]
high_prices

Seems like these apps are not useful as it all have some common keywords like I AM RICH, may be some developers had made these apps just for their joy or some other purpose. So, now it's necessory to filter out important apps like FINANCE, BUSINESS, HEALTH, SOCIAL, LIFESTYLE, TOOLS, GAMES, COMMUNICATION, and EDUCATION.

In [None]:
plt.figure(figsize = [15, 10])
sns.set_context('paper')


imp_cats = ['FINANCE', 'BUSINESS', 'HEALTH', 'SOCIAL', 'LIFESTYLE', 'TOOLS', 'GAME', 'COMMUNICATION', 'EDUCATION']

popular_apps = apps_with_price.loc[(apps_with_price.Category.isin(imp_cats))]

# Let's explore these application under price $100
popular_apps_under_100 = popular_apps.loc[(popular_apps.Price <= 100)]

print("Average Price: ${}".format(round(apps_with_price.Price.mean(), 2)))

sns.set()
plt.plot(popular_apps_under_100.Price, popular_apps_under_100.Category, marker='D', linestyle='none',\
        markersize = 8, color='goldenrod')
plt.axvline(apps_with_price.Price.mean(), linewidth=1.5, linestyle='-')
plt.text(16, 2, '<-- Average Price', fontsize=15, color='red')
plt.xlabel("Price (in dollars)", fontsize=14)
plt.ylabel("Popular Categories", fontsize=14)
plt.title("Prices of Popular categories under $100", fontsize=16)
plt.show()

We can say that FINANCE, LIFESTYLE, TOOLS, GAMES, and BUSINESS applications tends to be more expensive then others as these applications have prices greater then average price. However there is only one BUSINESS application with price close to 100 dollar, Now we will find that application and explore its reviews and rating to make sure that the development of this application is purposeful.

In [None]:
business_outlier = popular_apps_under_100[popular_apps_under_100.Price >= 85]
business_outlier

This application has no rating and has only 6 reviews, its size is 10 MB and it has only 10 installs. Now we are about to end this analysis after exploring which type of applications get greater number of reviews and then we will perform sentiment analysis on reviews csv of playstore applications to find apps with positive reviews and rating.

## **Sentiment Anaysis Based on Reviews**

### We can draw conclusion from reviews of users

In [None]:
reviews_df = pd.read_csv('/content/User Reviews (3).csv')
reviews_df.head()

## Variables Description

### **2. User Reviews.csv**

App : the app name

Translated_Reviews : The reviews text in English.

Sentiment : the sentiment of the review, positive, neutral, or negative.

Sentiment_polarity : the sentiment in numerical form, ranging from -1.00 to 1.00

Sentiment_Subjectivity : a measure of the expression of opinions, evaluations, feelings, and speculations.

In [None]:
# check for null values

reviews_df.isna().sum()

In [None]:
# check duplicates

reviews_df.duplicated()

In [None]:
reviews_df.loc[(reviews_df.duplicated()==True)]

In [None]:
# drop duplicates

reviews_df.drop_duplicates(inplace=True)

In [None]:
reviews_df.shape

In [None]:
# after removing duplice records we are left with relatively small amount of null values

reviews_df.isna().sum()

Merge previous application dataframe with this review dataframe to explore reviews of free, paid, and popular category of apps.

In [None]:
app_df = reviews_df.merge(apps2, on=['App'], suffixes=('_rev', '_org'))
app_df.head()

In [None]:
app_df.shape

In [None]:
# group applications based on their category and then sentiment to find out the proportion of
# positive, negative, and neutral reviews of popular categories

category_sent = app_df.groupby('Category', group_keys=False)['Sentiment'].value_counts(normalize=True).rename('proportion').reset_index()
category_sent

In [None]:
# from all above categories pick out records with popular categories only

sent_of_pop_cat = category_sent[category_sent.Category.isin(imp_cats)]
sent_of_pop_cat

#### Chart - 9 -  Grouped Bar chart Using PLOTLY to plot proportion of sentiment of each category

In [None]:
# Chart - 9 visualization code

fig = px.bar(sent_of_pop_cat, x="proportion", y="Category", color="Sentiment", barmode="group",
             color_discrete_sequence=px.colors.qualitative.Safe, hover_name="Sentiment",
             category_orders={ "Sentiment": ["Positive", "Negative", "Neutral"]})

fig.update_layout(barmode='stack', title="Sentiment Proportion Along Category")
fig.update_xaxes(showgrid=True, ticks="outside", tickson="boundaries")

fig.show()

##### 1. Why did you pick the specific chart?


The specific chart, a grouped bar chart, was chosen because it effectively presents and compares the sentiment proportion across different categories of applications. The use of grouped bars allows for a clear visualization of the sentiment distribution within each category, while the color coding distinguishes between positive, negative, and neutral sentiments. This chart type is particularly useful when there is a need to compare multiple categories and their corresponding sentiment proportions in a concise and easily interpretable manner.

##### 2. What is/are the insight(s) found from the chart?

Education, comminucation, and Lifestyle applications usually get positive response, Games get most negative comments as compared to other applications.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.


The gained insights from analyzing the sentiment proportion across different application categories can potentially have a positive business impact. By identifying that education, communication, and lifestyle applications usually receive positive responses, businesses operating in these categories can focus on leveraging this positive sentiment to attract more users, improve customer satisfaction, and potentially increase user engagement and revenue.

On the other hand, the insight that games receive the most negative comments compared to other applications could potentially lead to negative growth for businesses in the gaming industry. Negative sentiment towards games may indicate issues with gameplay, user experience, or other factors that result in dissatisfaction among users.

### Also find out the effect of sentiments on type of application.

#### Chart - 10 - Stacked Bar Plot - Effect of sentiments on type of application.

In [None]:
# Chart - 10 visualization code

# group applications on the basis of their type (paid, free) and then sentiments

type_sent = app_df.groupby('Type')['Sentiment'].value_counts(normalize=True).rename('percent').reset_index()
type_sent

In [None]:
# use plotly to plot bar plots

fig = px.bar(type_sent, x="Type", y="percent", color="Sentiment",
             color_discrete_sequence=px.colors.qualitative.Antique, hover_name="Sentiment",
             category_orders={ "Sentiment": ["Positive", "Negative", "Neutral"]})

fig.update_layout(barmode='stack', title="Sentiment Proportion Along Type")
fig.update_xaxes(showgrid=True, ticks="outside", tickson="boundaries")

fig.show()

##### 1. Why did you pick the specific chart?

I picked the specific chart, a stacked bar plot, because it effectively visualizes the proportion of positive, negative, and neutral sentiments for both paid and free applications. By using different colors for each sentiment category, it allows for easy differentiation and comparison between the two types of apps.

##### 2. What is/are the insight(s) found from the chart?

By plotting Sentiment proportion of Paid VS Free apps, we observe that free apps receive a lot of harsh comments, as indicated by the graph. Reviews for paid apps appear never to be extremely negative. This may indicate something about app quality, i.e., paid apps being of higher quality than free apps on average. In this notebook, we analyzed over ten thousand apps from the Google Play Store. We can use our findings to inform our decisions should we ever wish to create an app ourselves.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The gained insights can potentially help create a positive business impact. The analysis shows that paid apps have a higher proportion of positive sentiments compared to free apps, indicating that users generally have a more positive experience with paid apps. This suggests that investing in creating high-quality paid apps can lead to positive user feedback and potentially drive revenue growth. On the other hand, the analysis also reveals that free apps receive a significant percentage of negative sentiments. This implies that free apps may have lower overall user satisfaction, which could negatively impact user retention and hinder monetization opportunities through ads or in-app purchases. It is important for businesses to consider these insights and find ways to address the negative sentiment patterns in free apps, such as improving user experience, addressing user feedback, or considering alternative monetization strategies to ensure sustainable growth.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Based on the analysis, it is suggested to focus on app categories that have high ratings and pricing. The combination histogram and KDE plot provide insights into the distribution of ratings based on app size and pricing categories. By identifying categories with high ratings and high prices, the client can prioritize app development in those categories to potentially maximize revenue and profitability.

It is important to consider the relationship between app size and ratings. The scatter plot analysis reveals that larger-sized apps tend to have higher ratings. Therefore, the client should focus on optimizing the app development process to ensure that larger-sized apps are of high quality and provide a positive user experience. This may involve efficient coding practices, minimizing app size without compromising functionality, and addressing any performance issues.

To enhance app quality and user satisfaction, it is crucial to analyze user sentiment and reviews. The sentiment analysis on user reviews can help identify positive and negative sentiments associated with different app categories. This information can be used to address any issues or concerns raised by users, improve app features, usability, and overall user experience.

Also we can analyze the impact of app type (free or paid) on user sentiment and ratings. The stacked bar plot analysis reveals that paid apps generally receive more positive sentiments compared to free apps. This suggests that the we should consider offering paid apps as they may have a higher likelihood of positive user feedback and better user satisfaction. However, it is essential to consider market dynamics, competition, and user preferences while making pricing decisions.

# **Conclusion**

In conclusion, the analysis of the Play Store App Review has provided valuable insights for the client to achieve their business objectives. The findings suggest that optimizing app development practices for larger-sized apps can lead to higher ratings and improved user satisfaction. By prioritizing efficient coding practices and addressing performance issues, the client can enhance app quality and user experience. Additionally, the analysis highlighted specific app categories that command higher prices and receive higher ratings, indicating the client's focus should be on developing apps in these profitable categories to maximize revenue.

Furthermore, sentiment analysis of user reviews revealed the importance of addressing user feedback and incorporating user suggestions to improve app features, usability, and overall user experience. By actively responding to user sentiment and working towards enhancing app quality, the client can increase user satisfaction, improve ratings, and drive positive business impact. Moreover, considering the impact of app type (free or paid), the analysis suggests that offering paid apps can potentially lead to higher user satisfaction and positive user feedback.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***