<a href="https://colab.research.google.com/github/bhasskararav/playstore_data_EDA/blob/main/Play_Store_App_Review_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Play Store App Review Analysis**



##### **Project Type**    - **_Exploratory Data Analysis_**
##### **Contribution**    - **_BATTIVILLI BHASKARARAO (Individual)_**

# **Project Summary -**

_The Play Store apps data is a goldmine of information for app developers. It can be used to gain insights into what features are most popular, what genres are most successful, and what factors contribute to app engagement and success._

The data can be analyzed in a variety of ways, but some of the most common methods include:

**Exploratory data analysis (EDA)** This involves looking at the data in different ways to get a sense of its overall structure and distribution. This can be done using simple statistical techniques, such as plotting the distribution of app ratings or the number of downloads by genre.

**Data visualization:** This involves using charts and graphs to represent the data in a visually appealing way. This can help to identify trends and patterns that might not be obvious from the raw data.

Here are some specific questions that an app developer might be interested in answering:

* What are the most popular genres of apps?
* What are the most popular features in each genre?
* What are the factors that contribute to app engagement?
* What are the factors that contribute to app success?
* What are the trends in app development?

By answering these questions, app developers can gain a better understanding of the market and how to create apps that are more likely to be successful.

# **GitHub Link -**

https://github.com/bhasskararav

# **Problem Statement**


**App developers are constantly looking for ways to create more successful apps. However, it can be difficult to know what features to include and how to market their apps in order to achieve success.**

#### **Define Your Business Objective?**

 To gain insights into what features are most popular, what genres are most successful, and what factors contribute to app engagement and success. This information will be used to help app developers create more successful apps.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np     # Numby will help us perform on numerical values like mathematical functions
import pandas as pd    # It provides a wide range of features that can be used to perform complex data analysis tasks.
import seaborn as sns  #  It's for statistical data visualization.
import matplotlib.pyplot as plt  # for creating static, animated, and interactive visualizations


### Dataset Loading

In [None]:
_# Using pandas library we can loan our data.
Raw_Data = pd.read_csv('Play Store Data.csv')
Raw_Reviews = pd.read_csv('User Reviews.csv')

# Here I'm creating a copied data from the Raw main data sets. hence those are will be not disturbed
app_data = Raw_Data.copy()
app_reviews = Raw_Reviews.copy()


### Dataset First View

In [None]:
# to look at the data we can use head() function from pandas
app_data.head(10)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count, complete shape of the data
app_data.shape

### Dataset Information

In [None]:
# Dataset Info
# We should know about our data first then only we can perform better on the data.
app_data.info()

**Basic understandings from the data**
It’s important to know the different types of data/variables in the given dataset. These are the different types of data present in the dataset like float, object, string.

We can observe that, our dataset contains the data of about 10841 apps found on the play store.

Dataset has 13 columns which are the parameters of the apps. Let's look at each column -

* App - Name of the app
* Category - type of the app
* Rating - rated by the users out of 5
* Reviews - number of reviews given by users
* Size - size of the app in mb
* Installs - number of instalations of app
* Type - free or paid
* Price - price in $ of paid apps
* Content Rating - rating for which users can use app
* Genres - category or type of app
* Last Updated - date on which app updated last time
* Current Ver - current version of the app
* Android Ver - android version which app supports

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count, this is fisrt and most important step in data cleaning.

duplicate_data = app_data.duplicated(keep='first') # finding and assinging duplicated data on the duplicate_data variable from the data.
print(duplicate_data)
number_of_duplicate_values = duplicate_data.sum() # count of the duplicated values.

print(number_of_duplicate_values)

## Handling the Duplicated values

In [None]:
# checking the duplicates and handling them
# droping the duplicates
app_data.drop_duplicates('App', inplace=True)

app_data.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
app_data.isnull()

# Check if there are any null values in the data
app_data.isnull()
# Find the columns that contain null values
null_columns = app_data.columns[app_data.isnull().any()]
# Count the number of null values in each column
app_data[null_columns].isnull().sum()


In [None]:
# Visualizing the missing values
# Create a bar chart of the null values in the app_data dataframe
plt.bar(app_data.columns, app_data.isnull().sum())
plt.show()

### What did you know about your dataset?

In dateset we have **10841 rows and 13 columns**. there is duplicates values also present in our data.
1463 null values in _Rating column_, _Type_ and _Content Rating_ each columns has 1 null values. _Current Ver_ cloumn 8 and _Android Ver_ 3 null or missing values presents our data. after droping null values from App column there is **9660 rows and 13 columns** are there in data.


## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

print(app_data.columns)

**Columns present in data**
* App
* Category
* Rating
* Reviews
* Size
* Installs
* Type
* Price
* Content Rating
* Genres
* Last Updated
* Current Ver
* Android Ver

In [None]:
# Dataset Describe
app_data.describe()

### Variables Description

**App:** Name of the application.

**Category:** Category of the application.

**Rating:** Average rating of the application which given by the users out of 5.

**Reviews:** Reviews for the application which given by the users.

**Size:** Application size to install.

**Installs:** Total installations of the application.

**Type:** Type of the app like Paid or Free details.

**Price:** If it's paid app then price of the app.

**Content Rating:** Content category given for who can be a user.

**Genres:** which Category of the application.

**Last Updated:** detail of the app When was the last version updated.

**Current Ver:** Present and latest availability version of the Application.

**Android Ver:** details of Compatibility with which Android Version.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
app_data.dtypes
for column in app_data.columns:
  Unique_values = app_data[column].unique()
  print(Unique_values)



In [None]:
# Check Unique Values for each variable.
print(app_data['Category'].unique())

print(app_data['Size'].unique())



## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Checking for the Null Values
app_data[null_columns].isnull().sum()

As above data we can see there is missinmg data in Rating, Type, Content Rating, Current Ver, and Android Ver.

Let's handling the missing value.
start with **Type**

In [None]:
# Checking missing values in Type column

app_data[app_data['Type'].isna()]

Here we can see one missing value in Type column. and we can able to see there Installs column is zero so we can consider this application maybe not available. and we can say if it's available, this app category is also Free because Price of this app is also zero. so we can replace the Type as Free.

In [None]:
# Handling the missing value ins Type
app_data['Type'].fillna('Free', inplace= True)
app_data[app_data['Type'].isna()]

Now check with the **Content Rating** column

In [None]:
# Checking missing values in Content Rating column
app_data[app_data['Content Rating'].isna()]

In this data we can able to see data is completely mismatched. once check with rating column in this row, it's 19 but where user can give out of 5 rating only.(this will be the outlier of the data) so the date is mismatched with all the columns present in the row so, better to drop this row from the data.

In [None]:
#Droping missing value from Content Rating. which the row data is mismatched.
app_data.dropna(subset=['Content Rating'], inplace= True)

Now coming to the 2 column in the missing vaule category. they are **Current Ver and Android Ver**. I think those are not important features to Analyze the data. Those will change every time. Hense we can drop them instead of handling the missing the values.

In [None]:
#Droping the columns of Current Ver and, Android Ver.

app_data.drop(['Android Ver','Current Ver'], axis= 1, inplace= True)

Now check with the **Rating** columns which given by the users have 1463 missing values. This rating column is very important for the Developers to improve and develop the application. so let's check the statistical views and tabs for the column to handle those missing values.

In [None]:
#Plot distribution of Rating to check the skewness of data.
sns.distplot(app_data['Rating'])
plt.show()

As per the plot we can see the data it's left skewed data. and more than 50% data is above 4 Rating. let's check with stats to confirm more.

In [None]:
#Stats of the Rating column
app_data['Rating'].describe()

In [None]:
app_data['Rating'].median()

As per the plot and stats we can see the 75% data is above 4 Ratings so we can replace the missing data with Median.

In [None]:
# Replacing the missing values with Median of the Ratings
median_value= app_data['Rating'].median() #calculating the median value and assign to the variable
# Replacing the values
app_data['Rating'].fillna(median_value, inplace= True)

Almost all the missing values handled in a better way.
Let's check with our data now.

In [None]:
#After handlig the misssing values checking the data
app_data.isna().sum()

Great... Now in our data there is no missing values.
we done with cleasnig part of the Data Wrangling. let's check the data is there any Transforming need to done or not.

In [None]:
# After cleaning the data basic info
app_data.info()

Now we need to change the data types according to the data present in columns. **Reviews and Size** columns data are numarical data but data type is object. So, we need to change convert them into Int or Float data types.

In [None]:
#Changing the data type of the Review column.
app_data['Reviews'] = app_data.Reviews.astype(int)

In [None]:
#Let's check once the unique values of the Size column.
app_data['Size'].unique()

Next Column is Size, we can obsearv the data there few data is in KB and few data is in MB format. so we need to conver them into single format. we will remove M and convert kB to MB by replacing K with e-3.

* and one more thing need to be noted there is some applications size given like it's Varies with device. so we can't predick them replace better to drop them.

In [None]:
#Converting and changing data  types of Size
app_data['Size'] = app_data['Size'].apply(lambda x: x.replace('M', '')) #Removing M
app_data['Size'] = app_data['Size'].apply(lambda x: x.replace('k', 'e-3')) # converting K into e-3
app_data = app_data[app_data['Size'] != 'Varies with device']

#Changing data type from object to float.
app_data['Size'] = app_data.Size.astype(float)

Now need to convert **Price** column. and first of all we need to remove the $ symbol from the data points. then we can convert the data type into float.

In [None]:
#Removing $ symbol from the data.
app_data['Price'] = app_data['Price'].apply(lambda x: x.replace('$', '') if isinstance(x, str) else x)


#COnverting the data type from object to float
app_data['Price'] = app_data.Price.astype(float)

Now we need to convert the **Installs** column. again we need to remove **+** and **','** symbols from the data.

In [None]:
#Removing + and , symbols from the column
# First, remove '+' and ',' symbols from the 'Installs' column as strings
app_data['Installs'] = app_data['Installs'].apply(lambda x: str(x).replace('+', '').replace(',', '') if isinstance(x, str) else x)

# Now, convert the 'Installs' column to float data type
app_data['Installs'] = app_data['Installs'].astype(int)


* Now looking at the **last updated** column it contains the date on which the app is updated or launched last time. It is the object type which we have to convert date in the date-time format.

In [None]:
# Converting the data type of the Last Upadte
from datetime import datetime
import pandas as pd

def get_date(date_string):
    date_obj = datetime.strptime(date_string, '%B %d, %Y').date()
    date_obj = pd.to_datetime(date_obj)
    return date_obj
app_data['Last Updated'] = app_data['Last Updated'].apply(get_date)

* Almost all the Data cleaning process has been done, so let's see again entire data.

In [None]:
# We done some modifications in data right, so better we reset the data for proper index.
app_data.reset_index(drop=True, inplace=True)


In [None]:
# Checking the head of the data
app_data.head()

In [None]:
# Checking the tail of the data
app_data.tail()

In [None]:
# info about the data
app_data.info()

In [None]:
# Descriptive analysis stats
app_data.describe()

# **App Reviews file exploring**

In [None]:
# App reviews data shape view
app_reviews.shape

In [None]:
# Apps review data view
app_reviews.head(10)

In [None]:
# checking the data types
app_reviews.dtypes

In [None]:
#checking for the null values
app_reviews.isnull().sum()

In [None]:
# Merging the data
Merged_data = pd.merge(app_data, app_reviews)

In [None]:
# View the data after merge
Merged_data.head()

In [None]:
# checking for the null values
Merged_data.isnull().sum()

In [None]:
# Droping the null values
Merged_data = Merged_data.dropna(subset = ['Sentiment', 'Translated_Review'])

In [None]:
Merged_data.describe()

### What all manipulations have you done and insights you found?

* First of We checked for the dupicate values.
  we deleted the duplicated values. after deleting those data shape was 9660,13.
* Then We checked for the missing/Null values.
  There was 1463 in Rating, each 1 in Type and Content Rating, 8 in Current Ver, 3 in Android Ver present in the data.
    * Type = Filled with free for because price zero.
    *	Content Rating = Dropped the row (mismatch data was there)
    *	Android Ver = Filled with median (75% of the data is above 4)
* Reviews column converted into integer from object data type.
* Size column removed M and k converted them into float.
* INstall column removed +, and converted into integer.
* Price removed $, converted the data type into float.
* Last updated column data changed from object to datetype by changing
  the data formate.
* Merged the App data and reviews data.
* Dropped the null values from the Merged data.


## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Char -1
#Pie chart plotting to check which Category have most of users.
import matplotlib.pyplot as plt

# Assuming you have already loaded and prepared your 'app_data' DataFrame

# Grouping the data by 'Category' and summing 'Installs'
category_installs = app_data.groupby('Category')['Installs'].sum()

# Plotting the pie chart
plt.figure(figsize=(12, 12))  # Optional: Set the figure size
plt.pie(category_installs, labels=category_installs.index, autopct='%1.1f%%')
plt.title('App Installs by Category')

plt.axis('equal')  # Equal aspect ratio ensures the pie chart is circular.
plt.show()

##### 1. Why did you pick the specific chart?

* I picked this chart to visual the majority of the users using which category application.

##### 2. What is/are the insight(s) found from the chart?

* Most of the installs for the Gaming categorty which is 36.7%.
* Second highest is Family category which is stands at 11.3%.
* Third highest is Tools cagetory which stands at 9.2%.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Based on your observation, it appears that the majority of users tend to install game applications more frequently than other categories. This can be attributed to the fact that while users might use only one or a few popular social or utility applications regularly, they often install multiple game applications to enjoy various games. Additionally, users might continuously switch between different games, installing new ones when they become bored with the current selection. Consequently, game category apps have the advantage of gaining users rapidly, provided they are interesting and appealing. This sets them apart from other categories of apps, which may not experience the same level of frequent installation and user engagement.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
import matplotlib.pyplot as plt

# Assuming you have already loaded and prepared your 'app_data' DataFrame

# Grouping the data by 'Category' and summing 'Installs'
category_installs = app_data.groupby('Category')['Installs'].sum()

# Sorting the categories based on the number of installs in descending order
category_installs = category_installs.sort_values(ascending=False)

# Plotting the bar chart
plt.figure(figsize=(10, 6))  # Optional: Set the figure size
plt.bar(category_installs.index, category_installs)
plt.xlabel('App Category')
plt.ylabel('Number of Installs')
plt.title('Number of Installs by App Category')
plt.xticks(rotation=90)  # Optional: Rotate the category labels for better readability
plt.tight_layout()  # Optional: Adjust the spacing to prevent label cutoff
plt.show()


##### 1. Why did you pick the specific chart?

*  As the above Pie Chart we understood the majority of the category which more installs. so here we can able to see the Number of installs for those category.

##### 2. What is/are the insight(s) found from the chart?

* Game Category has the highest number of installs with 1.15+M.
* Second highest category is Family 0.39+ installs.
* Third highest category is Tools with 0.35+M installs

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Consequently, game category apps have the advantage of gaining users rapidly,
  provided they are interesting and appealing. But its hard to keep them as active users.
  

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# creating dataframe to get count of apps per Category and Content Rating
df = app_data.groupby(['Category','Content Rating']).count().reset_index()
df = df[['Category','Content Rating','App']]
df1 = df.pivot(columns='Content Rating', index='Category', values='App')

# Ploting bar chart
color_lst = ['gold', 'dodgerblue', 'deeppink','purple','red','magenta']
df1.plot(kind='bar', stacked= True, color= color_lst, figsize= (14,7))
plt.ylabel("Count")
plt.title("Categories and their targeted age groups")
plt.show()

##### 1. Why did you pick the specific chart?

* I choosen this bar chart to explore the data in the categories that they are targetting on which age group people mostly.

##### 2. What is/are the insight(s) found from the chart?

* Here we can see there Highest applications are there in Family category. and
  most of the application content is for everyone.
* and Second highest category is Game which is targetting more Teenages group.
* Coming to Tools category, its almost targetting everyone.
* And rest of categories almost targetting the all age group people but Auto  
  and Vehicles, Libraries and Demo, Parenting categories are targetting everyone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* As per plot we can understand where the Game developers targetting the teen most
  there is the most of the installs was done.
* Rest of application are mostly targetting to everyone.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Grouping the data by 'Category'
grp = app_data.groupby('Category')

# Calculating the average number of installs and total price for each category
x = grp['Installs'].agg(np.mean)
y = grp['Price'].agg(np.sum)
z = grp['Reviews'].agg(np.mean)
# Plotting the data
plt.figure(figsize=(16, 5))
plt.plot(x.index, y, 'r--', color='b')
plt.xticks(rotation=90)  # Corrected the spelling of 'rotation'
plt.title('Category Vs Pricing')
plt.xlabel('Categories')
plt.ylabel('Prices')
plt.show()

##### 1. Why did you pick the specific chart?

* I choosen this Chart to check the Price details by the category

##### 2. What is/are the insight(s) found from the chart?

* Family, Finance, Lifestyle, and Medical Category application are having high
  paid application than others.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Finanace and Lifestyle categories have less applications and less count of
  installs too but pricing is higher. may that is the reason those don't have more users.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Plotting the data
plt.figure(figsize=(16, 5))
plt.plot(x.index, z, 'r--', color='b')
plt.xticks(rotation=90)  # Corrected the spelling of 'rotation'
plt.title('Category Vs Reviews')
plt.xlabel('Categories')
plt.ylabel('Reviews')
plt.show()

##### 1. Why did you pick the specific chart?

* I choosen this plot to check reviews contribution by the user which devided  
  by the category.

##### 2. What is/are the insight(s) found from the chart?

* There is huge responce from the Games category users and mostly for creative
  application & communication application got the most reviews from the users.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Reviews are playing mojarity role in terms of getting new users. hence there
  is more reviews for Games category.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Plotting the chart to see the data of paid apps Vs free app

free_apps = app_data[app_data['Type'] == 'Free']['Installs']
paid_apps = app_data[app_data['Type'] == 'Paid']['Installs']

plt.figure(figsize=(6, 6))
plt.bar(['Free Apps', 'Paid Apps'], [free_apps.sum(), paid_apps.sum()])
plt.title('Total Installs for Free Apps Vs. Paid Apps')
plt.xlabel('App Type')
plt.ylabel('Total Installs')
plt.show()


##### 1. Why did you pick the specific chart?

* I choosen this plot to check the installation count by the price category.

##### 2. What is/are the insight(s) found from the chart?

* Obviously Free apps will get more users. Here we can see the same result.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Free application will reach to more users. if they want to get revenue they
  can place add instead of collecting fund to get a user.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
category_counts = app_data['Category'].value_counts()

plt.figure(figsize=(12, 6))
plt.bar(category_counts.index, category_counts.values)
plt.xticks(rotation=90)
plt.xlabel('Category')
plt.ylabel('Number of Apps')
plt.title('Number of Apps in Each Category')
plt.show()

##### 1. Why did you pick the specific chart?

I choosen this plot to check how many application are there in each Category.

##### 2. What is/are the insight(s) found from the chart?

* We can see there is more application are in Family, Game and Tools.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

*

#### Chart - 8

In [None]:
# Chart - 8 visualization code
content_rating_counts = app_data['Content Rating'].value_counts()

plt.figure(figsize=(8, 8))
plt.xticks(rotation=90)
plt.pie(content_rating_counts, labels=content_rating_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Content Ratings')
plt.show()

##### 1. Why did you pick the specific chart?

I choosen this plot to check the Distribution of the content rating

##### 2. What is/are the insight(s) found from the chart?

* 82% application are for Everyone
* 10.6% are for Teens
* 4.0% are Mature 17+
* and 3.2% are 10+

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Almost most of the applications are for Everyone. few percent are there for
  others.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
plt.plot(app_data['Rating'], app_data['Reviews'], 'o')
plt.xlabel('Rating')
plt.ylabel('Reviews')
plt.title('Rating vs. Reviews')
plt.show()

##### 1. Why did you pick the specific chart?

I choosen this plot to check relation between  Ratings & Reviews.

##### 2. What is/are the insight(s) found from the chart?

* We can see the plot there top rated applications have high reviews too.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Where there is good ratings were given there we can see the reviews also  
  more. and its leasds to get more users too.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 6))
plt.scatter(app_data['Rating'], app_data['Installs'], alpha=0.5)
plt.xlabel('Rating')
plt.ylabel('Installs')
plt.title('Rating vs. Installs')
plt.show()

##### 1. Why did you pick the specific chart?

I choosen this plot to show the relation between Ratings and Installs.

##### 2. What is/are the insight(s) found from the chart?

* As we can see there is High rated application have high installs.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Obviously if application rating is good then definitely installs will
  increase.

#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(8, 6))
plt.boxplot(app_data['Price'])
plt.ylabel('Price')
plt.title('Distribution of App Prices')
plt.show()

##### 1. Why did you pick the specific chart?

I choosen this plot to check the Price Disrtibution.

##### 2. What is/are the insight(s) found from the chart?

* 400$ is top paid applications price

* Most of the applications price is between below 50$.

* and very less between 150 to 350$ pricings.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Application price will be play imporatant role in installs.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(10, 6))
plt.hist(app_data['Size'], bins=20)
plt.xlabel('Size')
plt.ylabel('Number of Apps')
plt.title('Distribution of App Sizes')
plt.show()

##### 1. Why did you pick the specific chart?

I choosen this plot to check the app size distribution

##### 2. What is/are the insight(s) found from the chart?

* Most of the application are below 50Mb app size

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* Obviously app size will play importatnt role to get userws. its should not be
  more and not should be less.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Assuming the 'Type' column contains values 'Free' and 'Paid'
type_category_counts = app_data.groupby(['Category', 'Type']).size().unstack()

plt.figure(figsize=(12, 6))
type_category_counts.plot(kind='bar', stacked=True)
plt.xlabel('Category')
plt.ylabel('Number of Apps')
plt.title('Number of Free and Paid Apps in Each Category')
plt.legend(title='Type', loc='upper right', labels=['Free', 'Paid'])
plt.xticks(rotation=90)
plt.show()


##### 1. Why did you pick the specific chart?

I choosen this plot to check which category have which type of applications

##### 2. What is/are the insight(s) found from the chart?

* Most of the Category have each type of applications.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

* High amount of application are Free type only. and those are only have more
  users too.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

numerical_columns = ['Rating', 'Reviews', 'Size', 'Installs', 'Price']
correlation_matrix = app_data[numerical_columns].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of App Data')
plt.show()


##### 1. Why did you pick the specific chart?

To check the Correlation between the each Column

##### 2. What is/are the insight(s) found from the chart?

* The strongest positive correlation is between Reviews and Installs (0.64), which means that apps with more reviews tend to have more installs. This makes sense because more reviews indicate more popularity and user feedback.
* The strongest negative correlation is between Rating and Price (-0.02), which means that apps with higher ratings tend to have lower prices. This could be because users are more likely to rate apps favorably if they are free or cheap.
* There is no correlation between Size and Installs (0), which means that the size of an app does not affect how many times it is installed. This could be because users do not care much about the size of an app when they download it

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
numerical_columns = ['Rating', 'Reviews', 'Size', 'Installs', 'Price']
sns.pairplot(app_data[numerical_columns])
plt.suptitle('Pair Plot of App Data', y=1.02)
plt.show()

##### 1. Why did you pick the specific chart?

To identify the pattern and trend in the data Columns

##### 2. What is/are the insight(s) found from the chart?

- There is a **positive correlation** between the number of reviews and the rating of the app. This means that apps with more reviews tend to have higher ratings. This could indicate that users are more likely to rate an app if they are satisfied with it, or that apps with higher ratings attract more users and reviews.
- There is a **negative correlation** between the price and the rating of the app. This means that apps with higher prices tend to have lower ratings. This could indicate that users are more sensitive to the price of the app and expect more quality and features from expensive apps, or that cheaper apps are more appealing and accessible to a larger audience.
- There is a **weak correlation** between the size and the rating of the app. This means that the size of the app does not have a significant impact on its rating. This could indicate that users are more concerned about the functionality and usability of the app rather than its size, or that app developers optimize their apps to reduce their size without compromising their quality.
- There is a **cluster** of apps with low price, low size, high rating, and high number of reviews. This could indicate that these apps are very popular and successful in the market, as they offer good value for money, take up less space on the device, provide high satisfaction to the users, and generate a lot of feedback.

## **5. Solution to Business Objective**

* **Focus on Quality**: Prioritize app quality and user experience to maintain high ratings and positive reviews. Regularly update and optimize the app based on user feedback.

* **Understand User Preferences**: Analyze user preferences for app categories, content ratings, and genres to develop targeted marketing strategies.

* **Pricing Optimization**: If offering paid apps, ensure that the price aligns with the perceived value and competitive landscape. Consider offering trial versions or in-app purchases to increase user engagement.

* **Optimize App Size**: Keep the app size reasonable to minimize barriers to installation. Consider optimizing app assets and resources to reduce file size.

* **Monitor Compatibility**: Stay updated with the latest Android versions and ensure app compatibility to reach a broader user base.

* **Regular Updates**: Consistently update the app to introduce new features, fix bugs, and keep the app relevant to users' evolving needs.

* **Market Research**: Continuously monitor the app market and competitors to identify opportunities and stay ahead in the competitive landscape.

# **Conclusion**


Based on the analysis of the app_data, several key insights and observations can be drawn. Here are some of the main conclusions:

* **App Categories:** The data shows the distribution of apps across various categories. Some categories have a significantly higher number of apps than others. Further investigation into the popular categories can help understand user preferences and market trends.

* **Ratings and Reviews:** The average app rating provides an insight into user satisfaction. Apps with higher ratings tend to have more positive reviews. It is essential for app developers to focus on improving app quality to maintain higher ratings and positive user feedback.

* **Pricing Strategy:** The data shows a mix of free and paid apps. Paid apps need to provide added value to users to justify the cost. Developers can analyze user preferences for free vs. paid apps to optimize their pricing strategy.

* **Content Rating and Genres:** Understanding the distribution of content ratings and genres can help identify the target audience for each app category. Tailoring app content to the appropriate audience can improve user engagement and satisfaction.

* **App Size and Compatibility:** The size of the app can influence user decisions to install. Analyzing app size in relation to installs and ratings can provide insights into user preferences and device compatibility.

* **Time-Based Analysis:** Tracking app updates over time helps identify trends in maintenance and improvement. Regular updates can positively impact user ratings and app performance.

* **Correlations:** Analyzing correlations between numerical variables can reveal relationships and patterns in the data. For example, there might be a positive correlation between app installs and reviews, indicating that more popular apps tend to receive more reviews.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***