<a href="https://colab.research.google.com/github/Vikgta/PlayStore-Exploratory-Data-Analysis/blob/main/PlayStore_EDA_Submission_Vikash_Gupta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -



##### **Project Type**    - EDA
##### **Contribution**    - Individual



# **Project Summary -**

* With 2.5 billion active users, Android stands as the most popular operating system, commanding about 85% of the mobile device market. Google's OS dominates the landscape, and its largest app repository, the Google Play Store, remains a cornerstone for Android users.

* The extensive data housed in the Play Store offers significant potential to propel app-centric businesses to success. Developers can extract actionable insights from this treasure trove of information, enabling them to tailor their strategies and conquer the expansive Android market.

* The primary objective of our project was to meticulously gather and analyze comprehensive data on Google Play Store apps. The aim was to furnish insights into app features and present an accurate snapshot of the current state of the Android app ecosystem.

* The project's goal was to explore and scrutinize data, identifying pivotal factors influencing app engagement. The objective was to recommend optimal features that would pave the way for app success in the dynamic Android landscape.

* Our exploration delved into understanding relationships between various attributes, such as the app's pricing model (free or paid), user reviews, and overall ratings. The dataset provided encompasses details like Category, Review, Rating, Size, etc., spanning across 10,841 rows and 13 columns.

# **GitHub Link -**

[Link to GitHub](https://github.com/Vikgta/PlayStore-Exploratory-Data-Analysis)

# **Problem Statement**


Android, currently holding a substantial 74% of the market share, is a rapidly expanding operating system, reflecting its widespread adoption among a large portion of the population. Our objective is to assist Android developers in understanding the key drivers behind app downloads. We aim to uncover the factors influencing users' decisions to download an app, focusing on categories, reviews, pricing, ratings, and installations. By analyzing the interrelationships among these variables, we aim to provide valuable insights.

To address these goals, we have formulated several problem statements that will guide our analysis of the given dataset:

1.Category Installations:

* Determine which app category boasts the highest number of installations.
2.Top 5 Most Installed Apps:

* Identify and list the top 5 apps with the maximum number of installations.
3.Top 5 Apps with Low Installations:

* Identify and list the top 5 apps with the lowest installation numbers, highlighting areas for improvement.
4.Importance of Ratings:

* Explore and analyze the significance of ratings for applications. Understand how user ratings impact an app's success.
5.Top Categories on the Play Store:

* Identify the top categories that dominate the Google Play Store in terms of the number of apps and installations.
6.Free vs. Paid Apps:

* Determine the distribution of apps in terms of being free or paid. Understand the user preference for free or paid applications.

By addressing these problem statements, we aim to provide actionable insights for Android developers, enabling them to make informed decisions regarding app development, marketing, and user engagement strategies.

#### **Define Your Business Objective?**

* The objective of my analysis is to provide insights about android applications and their categories.
* To deep dive in data for the factors of influences on an application, to know why and how certain applications succeed and others.
* Finds the key Factor that are responsible for app engagement.
* Study the detailed information of app and analyse them.
* Finds which attributes are most important for application.
* Also, what is required for an application to be considered as successfully topping the charts.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
import missingno as msno   #library to identify the missing values/numbers

#setting font size throughout the notebook
plt.rcParams.update({'font.size': 14})

%matplotlib inline
#to keep the graph visible even after the disconnect

### Dataset Loading

In [None]:
# Load Dataset
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')


# reading data file
dir_path = '/content/drive/MyDrive/Colab Notebooks/'
file_name = 'Play Store Data.csv'
playstore_file_path = dir_path + file_name
df = pd.read_csv(playstore_file_path)

### Dataset First View

In [None]:
# Dataset First Look
df.head(5)

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape
#we see that the data has 10841 rows and 13 columns

### Dataset Information

In [None]:
# Dataset Info
df.info()

we see that except ratings other values are of object type, so we need to evalvulate the fields with respect to there values such as int or other respective column type.


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
dup = df.duplicated().sum()
print(dup)

When we see duplicate counts we found 483 duplicate values

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missing_values = df.isna().sum()
print(missing_values)

In [None]:
# Visualizing the missing values
msno.matrix(df)

### What did you know about your dataset?

we see that **except ratings other values are of object type,** so we need to evalvulate the fields with respect to there values such as int or other respective column type.

**the following are the counts for null values.**

*    Rating:            1474
*   Type:                 1
*   Content Rating:       1
*   Current Ver:          8
*   Android Ver:  3

**The following columns are listed in the dataframe**

1.   App : indicates the name of the apps (this column has duplicate values in the rows)
2.   Category : this indicates the various categories of the apps
3.   Rating: this indicates the ratings recieved by the respective apps
4. Reviews: this indicates the number or reviews recieved by the app
5. Size :indicates the size of the app
6. Installs: indicates the number of installs of the respective app
7. Type: Shows the type of the app which is free or paid
8. Price: this field shows the price of the fields, 0 for free and amount if paid
9. Content Rating:shows the targated audiance of the app and what type of audiance has reviewed the app
10. Genres: shows the Genre of the app, (same as category)
11. Last Updated: shows the latest date of the app that has updated
12. Current Ver: shows the latest version of the app
13. Androaid Ver: shows the version of android which can support the given app

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
print('data Describe')
print('-'*30)
print(df.describe())
print('-'*30)
print('data info')
print('-'*30)
print(df.info())

### Variables Description

the following columns are listed in the dataframe

1.   App : indicates the name of the apps (this column has duplicate values in the rows)
2.   Category : this indicates the various categories of the apps
3.   Rating: this indicates the ratings recieved by the respective apps
4. Reviews: this indicates the number or reviews recieved by the app
5. Size :indicates the size of the app
6. Installs: indicates the number of installs of the respective app
7. Type: Shows the type of the app which is free or paid
8. Price: this field shows the price of the fields, 0 for free and amount if paid
9. Content Rating:shows the targated audiance of the app and what type of audiance has reviewed the app
10. Genres: shows the Genre of the app, (same as category)
11. Last Updated: shows the latest date of the app that has updated
12. Current Ver: shows the latest version of the app
13. Androaid Ver: shows the version of android which can support the given app

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
for column in df.columns:
  res = len(df[column].unique())
  print(f"The number of variables in {column} column are: {res}")

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#creating one funtion that can evaluate Price, Installs, Size and Reviews
def convert_int(x):
  ''' this function takes the object and removes ',' + , M, k, $ returns it
  after evaluating by the function eval
  '''
  if x.isdigit():         # Checking if the object is alread integer to and converting it before iterating if loops for making the function more optimised
      return eval(x)
  else:
      if ',' in x:          #writing series of if condition to reduce the operations complexity on the objects
        x=x.replace(',','')
      if '+' in x:          #this occurs in installs colum, so we are removing the same to have a proper int format
        x=x.replace('+','')
      if 'M' in x:
        x=x.replace('M','')        # since this value occurs in Size we are converting Millions to int format by adding 6 zeros to the same
        x= eval(x)
        x = x * 10**6
        return x
        # x=x.replace('M','')
      if 'k' in x:          # since this value occurs in Size we are converting Kilo to int format by adding 3 zeros to the same
        x=x.replace('k','')
        x = eval(x)
        x= x * 10**3
        return x
        # x=x.replace('k','')
      if '$' in x:
        x=x.replace('$','') #this occurs in price column, so we are taking out the $ symbol
      if x=='Free' or x=='Varies with device'or x=='Everyone':    #there are some columns entered with missplaced values, setting them to zero.
        x=0
      else:
        x= eval(x)

      return x


      # return x


In [None]:
df['Reviews'] = df['Reviews'].apply(convert_int)
df['Size'] = df['Size'].apply(convert_int)
df['Installs'] = df['Installs'].apply(convert_int)
df['Price'] = df['Price'].apply(convert_int)

### What all manipulations have you done and insights you found?

After doing various analysis accross apps, the following manipulations are done

In [None]:
# we see that there is one rating value that is 19 which could be human error, this may affect the median value in further steps, so we set it as 1.9.
df.loc[df['Rating'] == 19, 'Rating'] = 1.9

In [None]:
x = df.Rating.unique()
x.sort()
print(x)

In [None]:
# We see that some values in category has underscores in it, so we are replacing the underscore with space
def remove_underscores(x):
  if '_' in x:
    x = x.replace('_', ' ')
  return x

In [None]:
# applying 'remove_underscores' function on category column

df['Category'] = df['Category'].apply(remove_underscores)

In [None]:
# finding null values
df.isna().sum()

e see that the following are null values.

*    Rating:            1474
*   Type:                 1
*   Content Rating:       1
*   Current Ver:          8
*   Android Ver:  3


since there are many values in ratings that can be replaced with median values

In [None]:
#By ploting distribution plot we can choose between mean amd median.
plt.title('Distribution plot of Rating Feature')

sns.histplot(df['Rating'],kde = True)

In [None]:
# In this case we use median to have A Normal Distribution
median_value_for_ratings = df['Rating'].median()
median_value_for_ratings = round(median_value_for_ratings, 2)
df['Rating'] = df['Rating'].fillna(median_value_for_ratings)
print(f'the values for null us updated as {median_value_for_ratings}')

In [None]:
# since the other values are minum null values, we drop them
df = df.dropna(subset = ['Type', 'Content Rating', 'Current Ver', 'Android Ver'])

In [None]:
# Dataset Duplicate Value Count
dup = df['App'].duplicated().sum()
print(dup)

we see that there are 1181 duplicated rows in app names colums, so we are keeping the first and dropping the remaing

In [None]:
#we see that there are duplicate names off apps in apps column, so dropping duplicates and keeping first
df = df.drop_duplicates(subset=['App'],keep="first")

## Summary of data Manipulation

* Rating had 1474 null values which contributes 13.60% of the data so the null values are replaced by median ie 4.3
* the following null values are dropped
>* Type had 1 null value which contributes 0.01% of the data.
>* Content Rating had 1 null value which contributes 0.01% of the data.
>* Current Ver had 8 null values which contributes 0.07% of the data.
>* Android Ver had 3 null values which contributes 0.03% of the data.

The following actions are taken
1. Removed Outliers from Rating column.
2. Removed underscore from Required columns.
3. Converted the Data types from object to int or float.
4. Handling the missing value.
5. Solving the Structural Error.


### after data cleaning and manipulation, we are now again analysing the data for understanig if all the discrepancies are gone

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.head()

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

### Chart - 1
Comparision of users using paid app and free app


In [None]:
# Chart - 1 visualization code
# application type distribution
#Here we use pie plot
fig, ax = plt.subplots(figsize=(5, 5), subplot_kw=dict(aspect="equal"))
number_of_apps = df['Type'].value_counts()
labels = number_of_apps.index
sizes = number_of_apps.values

# plotting the same
ax.pie(sizes, labeldistance=2,autopct='%1.1f%%')
ax.legend(labels=labels,loc="right",bbox_to_anchor=(0.9, 0, 0.5, 1))
ax.axis("equal")
plt.title('Type Distribution')
# plt.show()


##### 1. Why did you pick the specific chart?

A pieplot shares the distribution of apps by their type(free or paid) very clearly.

##### 2. What is/are the insight(s) found from the chart?

We see that the total number of apps which are paid are 7.8% and rest are free.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It is observed that the majority of apps that are publised are free to use and very less number of apps are paid

#### Chart - 2
### Chart - 2  
Lowest rating for the apps with respect to categories

In [None]:
# Chart - 2 visualization code
# heights ratings with respect to category
category_avg_ratings = df.groupby('Category')['Rating'].min().sort_values(ascending = False)

# plotting
plt.rcParams['figure.figsize'] = (20, 5)
plt.bar(category_avg_ratings.index, category_avg_ratings.values)
plt.title('Minimum ratings of apps by Category')
plt.xlabel('Category')
plt.ylabel('Rating')
plt.xticks(rotation=90)

plt.show()

##### 1. Why did you pick the specific chart?

we are ploting minimum values against each category, so Bar chart would be best suited for the visualization to understand which category has lowest raitings and which has highest

##### 2. What is/are the insight(s) found from the chart?

We see that there are apps in the category that has lowest raiting and through this we can identify the category in which we can work on by exceptionally making a better app.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As a result we can see that there are many apps with lowest raiting and we have hope or chance to bring a better app to the market.

### Chart - 3
Free Vs Paid apps with respect to categories

In [None]:
# Chart - 3 visualization code
category_price_counts = df.groupby(['Category', 'Type']).size().unstack()

#calculation/logic
total_counts = category_price_counts.sum(axis=1)
category_price_ratios = category_price_counts.div(total_counts, axis=0) * 100

# create a stacked bar chart
category_price_ratios.plot(kind='bar', stacked=True)
plt.title('Percentage of Free and Paid Apps by Category')
plt.xlabel('Category')
plt.ylabel('Percentage')
plt.legend(title='Price Type')
plt.show()


##### 1. Why did you pick the specific chart?

A stacked bar plot shows the percentage for all app categories by type(free and paid)

##### 2. What is/are the insight(s) found from the chart?

We se that the percentage of paid apps in personelization and Medical are more.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can infer the ratio in percent wise distrubution at category level in the above plot, we see that the percent of Personalization and Medical app have more number of paid apps compare to other categories

### Chart - 4
Number of Apps Per Category count plot and Pie plot(pie code is commented)

In [None]:
# Chart - 4 visualization code
# Get the number of apps for each category using Count Plot

sns.set_style('darkgrid')
plt.figure(figsize=(20, 5))

sns.countplot(x='Category', data=df)

plt.title('Number of Apps Per Category', fontsize = 25)
plt.xticks(rotation=90)
plt.ylabel('Number of Apps')


####Percentage of apps belonging to each category in the playstore using Pie plot
# plt.figure(figsize=(5,5))

# plt.pie(df.Category.value_counts(), labels=df.Category.value_counts().index, autopct='%1.2f%%')
# my_circle = plt.Circle( (0,0), 0.050, color='white')

# plt.title('% of apps share in each Category', fontsize = 25)

##### 1. Why did you pick the specific chart?

We have plotted 2 types of chart to get the understanding for better, seaborn count plot and Pie(the commented code) explain better way to demonstrate the share of the number of apps accross various categories

##### 2. What is/are the insight(s) found from the chart?

We find that the number of apps in Family category is 18.5% followed by Games with 9.94% and tools app with 8.55%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We see that there are many apps in family, games and tools category, this plot can help us to understand the market size of the category in which we are trying to make our new app

### Chart - 5
Total app installs in each category

In [None]:
#  Chart - 5 visualization code
Max_Installs = df.groupby(['Category'])['Installs'].sum().sort_values()

#plotting
Max_Installs.plot.barh(figsize=(20,10), color = 'c', )
plt.ylabel('App Categories', fontsize = 15)
plt.xlabel('Mean app Installs', fontsize = 15)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)

##### 1. Why did you pick the specific chart?

A horizontal bar plot gives us a clear pitcure of the number of installs with respect to category.

##### 2. What is/are the insight(s) found from the chart?

We see that the maximum installs are for the gaming category and we followed by communication and tools.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It is observed that people tend to be more inclined toward the Gaming app and communication

### Chart - 6
Content rating of the apps for various group of people

In [None]:
# Chart - 6 visualization code
 # Content rating of the apps for diff age group of people
data = df['Content Rating'].value_counts()
labels = ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+','Adults only 18+', 'Unrated']

#create pie chart
plt.figure(figsize=(6,6))
explode=(0,0.1,0.1,0.1,0.0,1.3)
color = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data, labels = labels, colors = color, autopct='%.2f%%',explode=explode,textprops={'fontsize': 8})
plt.title('Content Rating')
# plt.legend()

##### 1. Why did you pick the specific chart?

Pie plots are usually best for sharing visually distribution of data

##### 2. What is/are the insight(s) found from the chart?

We found that most of the apps are open to be reviewed by Everyone.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

When we see correlation between app installs and review we see that the correlation is around 0.61, which is a good to have.
It is recommended to have reviews open to everyone to gain more installs

### Chart - 7
Paid app counts with respect to price (excluding free apps)

In [None]:
# Chart - 7 visualization code
#the plot shows the paid app counts with respect to price excluding free apps

free_app_value = 0
plt.rcParams['figure.figsize'] = (20, 5)
df_without_free_apps = df['Price'] != free_app_value
sns.countplot(data=df[df_without_free_apps], x='Price')

plt.xticks(rotation=90)
plt.title('price wise Paid app counts')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()

##### 1. Why did you pick the specific chart?

A seaborn count plot shows counts of the app with respect to various values, here we wanted to know the counts of apps with price not equal to 0, so we used the above chart

##### 2. What is/are the insight(s) found from the chart?

We see that max app belong to price = 0.99 followed by 2.99

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It is observed that there many apps are released with amount = 0.99 and 2.99


It would be a good practice if we can make sure the app falls in this price range to make sure the apps are affordable.

### Chart - 8
Ratings wrt Category (Box plot)

In [None]:
# Chart - 8 visualization code
plt.xticks(rotation=90)
plt.rcParams['figure.figsize'] = (15, 5)

#Plot Code
sns.boxplot(data=df, x="Category", y="Rating")

#Describing details for the plots
plt.title('Ratings with respect to various categories')
plt.xlabel('Category')
plt.ylabel('Rating')


##### 1. Why did you pick the specific chart?

Boxplots provides a concise summary of measures such as the median, quartiles, and outliers.

This makes it easy to compare multiple datasets quickly and identify any differences or similarities.

##### 2. What is/are the insight(s) found from the chart?

Communication and events have their medians outside the box, which means the ratings recieved are not skewed

There are many outliers in various category

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the outliers below the boxes show that various apps in category are having a bad reviews so there is room for placing our app with better features in the categories with more wiskers in box plot below the boxes.

### Chart - 9
Distribution of ratings across various apps

In [None]:
# Chart - 9 visualization code
plt.rcParams['figure.figsize'] = (20, 5)
sns.histplot(df, x='Rating', kde=True)

#Describing details for the plots
plt.title('Spread of various ratings across all the apps')
plt.xlabel('Rating')
plt.ylabel('Count')



##### 1. Why did you pick the specific chart?

Hist plots are better to analyze distribution of numric values

##### 2. What is/are the insight(s) found from the chart?

we see that most of the apps have a rating of 4.25

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

since most of the apps have 4.2 rating, there is an opportunity to have a better app that could fit the market needs

### Chart - 10
sentiments analysis

In [None]:
# Chart - 10 visualization code

#importing the user Reviews CSV as user_df
user_df = pd.read_csv(dir_path + 'User Reviews.csv')

# grouping and plotting sentinments
sentiment_group = user_df.groupby('Sentiment')
s = sentiment_group['Sentiment'].count()
#setting plot size
plt.figure(figsize=(5,5))

# plotting pie
plt.pie(s, labels = s.index,  startangle = 90,autopct='%1.2f%%')

#Describing details for the plots
plt.title('Share of various sentiments')
plt.show()

##### 1. Why did you pick the specific chart?

pie charts are better to demonstrate the proportions of the values

##### 2. What is/are the insight(s) found from the chart?

We see that there are many apps that are having positive sentiments respect to various apps

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We observe that most of the apps that received translated reviews are positive.

### Chart - 11
Listing top 10 apps based on the recieved translation

In [None]:
# Chart - 11 visualization code
merge_df =pd.merge(df,user_df,how='inner',on='App')
App_Cat_df = merge_df.groupby(['App','Category'])['Translated_Review'].count().reset_index()
App_Cat_df = pd.DataFrame(App_Cat_df)
App_Cat_df = App_Cat_df.sort_values(by=['Translated_Review'], ascending=False)
App_Cat_df.head(10)

##### 1. Why did you pick the specific chart?

Here we are viewing the apps that recieved max translated reviews.

##### 2. What is/are the insight(s) found from the chart?

we see that mostly the translated reviews are recieved by the Game category

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Game category has most number of translated reviews.

### Chart - 12
CountPlot

In [None]:
# Chart - 12 visualization
plt.figure(figsize=(20, 5))

#plotting
sns.countplot(data = App_Cat_df, x = App_Cat_df['Category'])

# Setting attributes to infer the plot
plt.title('number of translated reviews by category')
plt.ylabel('Number of Translated Review')
plt.xlabel('Category')
plt.xticks(rotation=90)
plt.show()

##### 1. Why did you pick the specific chart?

A count plot shows better repersentation of the counts with respect to their categories

##### 2. What is/are the insight(s) found from the chart?

We found that the games have recieved highest number of counts followed by Family, however the comics weather and events apps have recieved less reviews in the category.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Family and games category have recieved more number of translated reviews

### Chart - 13
Sentiment Visualization for GAME, FAMILY, SOCIAL, LIBRARIES AND DEMO

In [None]:
df.Category.unique()

In [None]:
# Chart - 13 visualization code
x=['GAME','FAMILY','SOCIAL', 'LIBRARIES AND DEMO']
#we can choose any category and include in the x variable

fig, axes = plt.subplots(1, len(x), figsize=(20, 5))

for i, category in enumerate(x):
    temp_df = merge_df[merge_df['Category'] == category]
    sns.countplot(data=temp_df, x='Sentiment', ax=axes[i])
    axes[i].set_title(category)

fig.suptitle('Sentiment Analysis by Category')

# Show the plot
plt.show()

In [None]:
# # Chart - 13 visualization code (old code)
# x=['GAME','FAMILY','COMMUNICATION', 'LIBRARIES AND DEMO']
# x = df.Category.unique()   #if we uncomment this all the category plots will be plotted in loop.
# for i in x:

#     temp_df = merge_df[merge_df['Category']==i]

# # Compare in positive and negetive review
#     plt.figure(figsize=(5, 5))
#     sns.countplot(data = temp_df, x = temp_df['Sentiment'])
#     plt.title(f'{i} app Sentiment')
#     plt.ylabel('Number of Translated_Review')
#     # plt.show()

##### 1. Why did you pick the specific chart?

Count plots are best suited to plot the counts of various rows and plot. This helps in Visually intrepreting the various counts

##### 2. What is/are the insight(s) found from the chart?

GAME, FAMILY, SOCIAL, LIBRARIES AND DEMO we see that mostly the apps have positive sentiments towards the given categories

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

the above charts show that most of the apps in category GAME, FAMILY, SOCIAL, LIBRARIES AND DEMO are having a positive sentiment

We can infer that social app category has approximatly 50% of negative sentiments

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
# Chart - 4 visualization code
corr_matrix = df.corr()

# Plot heatmap
sns.heatmap(corr_matrix, annot=True, cmap='inferno')

#setting labels to infer the plot
plt.title('Correlation Matrix heatmap')
plt.ylabel('Feature/Property')
plt.xlabel('Feature/Property')

##### 1. Why did you pick the specific chart?

the best way to plot a correlation matrix is by having a heatmap useing the seaborn Library.
> The values in the matrix range from -1 to 1, and represent the strength and direction of the correlation between two variables. A correlation coefficient of -1 indicates a perfect negative correlation, a coefficient of 0 indicates no correlation, and a coefficient of 1 indicates a perfect positive correlation.

##### 2. What is/are the insight(s) found from the chart?

We see that the correlation between Installs and Price is 0.63 which is good, It means that when there are more reviews on any app the more people tend to install the app. so its a better idea to get reviews on the app.

This gained insights are very helpfull to grow a business as we can add the feature to get reviews on the app and this shall help the app to grow.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
sns.pairplot(df, diag_kind="kde", kind = 'reg', hue = 'Type')

#setting labels to infer the plot
plt.title('Pair Plot')
plt.ylabel('Feature/Property')
plt.xlabel('Feature/Property')


##### 1. Why did you pick the specific chart?

Pair plots are used to show relationship between various variables

Pair plots can also help us explore the distribution of variables in your dataset.

##### 2. What is/are the insight(s) found from the chart?

we find that there is a relation between the reviews and installs, the max installs the max reviews we get, the regression lines is liner and increases as (x=y approximatly)

The installs doesn't grow as the price increases.

size and installs does not have relation.

the diagonal graphs show the kde plotted wrt the own feature, it is noted that the diagonal plots doesnot follow the scale.

### Chart - 16
Segmentation of apps by size as extra large, large, medieum and lite size

In [None]:
# Chart - 16 visualization code
min_size = df.Size.min()
max_size = df.Size.max()

def group_by_size(val):
  ''''
  This function help to categories the size from 101304 to 0
  as Extra Large App, Large App, Medium Size App and Lite Size App
  '''
  if val>=((max_size/4)*3):
    return 'Extra Large App'
  elif val >= ((max_size/4)*2)  and val<((max_size/4)*3):
    return 'Large App'
  elif val >= (max_size/4) and val<((max_size/4)*2):
    return 'Medium Size App'
  else:
    return 'Lite Size App'

# print(min_size)
# print(max_size)


In [None]:
df['Size Groups'] = df['Size'].apply(group_by_size)

In [None]:
plt.rcParams['figure.figsize'] = (20, 5)
size_group_df = df.groupby(['Size Groups'])['Installs'].sum()
size_group_df.plot.pie(autopct='%1.2f%%')

#setting labels to infer the plot
plt.title('share of apps based on basis of size categories defined by us')

##### 1. Why did you pick the specific chart?

Pie plot visually repersents better way to undersatnd the % coverd by variious app categories

##### 2. What is/are the insight(s) found from the chart?

we infer that most of the people prefer to have lite size apps i.e 75% approx where as the other apps are marginally spread.(may depend on the use case and their increased functionality)
extra large apps are of share 6% while medium and large apps share 9.5% of the total market.(as of 2018 data)

### Chart - 17
Segmentation of apps by size with respect to categories as extra large, large, medieum and lite size

In [None]:
# Chart - 17 visualization code
plt.rcParams['figure.figsize'] = (20, 5)
size_groupby_Categories_df = df.groupby(['Category', 'Size Groups'])['Installs'].sum().unstack()

size_groupby_Categories_df.plot.bar()
plt.title('segmentation of app install counts with respect to categories')
plt.xlabel('Apps By Categories')
plt.ylabel('Total Installs in Billions')

##### 1. Why did you pick the specific chart?

Bar plot is best suited to visualize the numbers against a categorical values.

##### 2. What is/are the insight(s) found from the chart?

We see that most of the gaming apps are of all the size category and comunication apps are preferred to be made of lite size.

also looking at trends we see most of the apps are tried to be made lite as it would lead to be a better option.

### Chart - 18
Apps with the last updates

In [None]:
def get_year(x):
  ''' This functions extracts year from the object and returns the same'''
  x = x[-4:]
  return eval(x)

In [None]:
df['year'] = df['Last Updated'].apply(get_year)

In [None]:
#distribution of the latest apps that are updated
sorted_by_year = df.sort_values(by = 'year', ascending = False)
sns.histplot(sorted_by_year, x='year', kde=True)

##### 1. Why did you pick the specific chart?

A hist plot shows the distribution of values by their count

##### 2. What is/are the insight(s) found from the chart?

we see that most of the apps are recently updated and some of them have been not updated since a long time.
For more detail analysis we can selectily visit the apps with old year of updates and check if the apps are stoped due to losses or closed.

### Chart - 19
Listing top apps with max reviews, size, price, and free apps with highest ratings

In [None]:
df.head()

In [None]:
# # Top five app with max Reviews
Max_Reviews = df.sort_values(by='Reviews', ascending=False)

# display the names and Reviews in the sorted dataframe
print('______________________________________________')
print('\ntop 5 apps with max reviews\n')
print('______________________________________________')

print(Max_Reviews[['App', 'Reviews']].head())

# # Top five app with max size
Max_Size = df.sort_values(by='Size', ascending=False)

# display the names and Size in the sorted dataframe
print('______________________________________________')
print('\ntop 5 apps with max size\n')
print('______________________________________________')

print(Max_Size[['App', 'Size']].head())

# # Top five app with max size
Max_Installs = df.sort_values(by='Installs', ascending=False)

# display the names and Size in the sorted dataframe
print('______________________________________________')
print('\ntop 5 apps with max Installs\n')
print('______________________________________________')

print(Max_Installs[['App', 'Installs']].head())

# # Top five app with max price
Max_Price = df.sort_values(by='Price', ascending=False)

# display the names and Reviews in the sorted dataframe
print('______________________________________________')
print('\ntop 10 apps with max price\n')
print('______________________________________________')
print(Max_Price[['App', 'Price']].head(10))


#heighest rating for top five free apps and their category
df_Free_apps = df[df['Type'] == 'Free']
df_Free_apps.Type.unique()
Top_5 = df_Free_apps.sort_values(by='Rating', ascending=False)

# display the names and Reviews in the sorted dataframe
print('______________________________________________')
print('\ntop 5 free apps with highest ratings\n')
print('______________________________________________')
print(Top_5[['App', 'Rating', 'Category', 'Type']].head())



##### 1. Why did you pick the specific chart?

We are not using any chart here, we are just listing the various apps. to have an overview of the various apps with respect to some predefined parameters.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

The Google Play Store Apps report provides some useful details regarding the trending of the apps in the play store. As per the graphs visualizations shown above, most of the trending apps (in terms of users' installs) are from the categories like GAME, COMMUNICATION, and TOOL even though the amount of available apps from these categories are twice as much lesser than the category FAMILY but still used most. The trending of these apps are most probably due to their nature of being able to entertain or assist the user. Besides, it also shows a good trend where we can see that developers from these categories are focusing on the quality instead of the quantity of the apps.

Other than that, the charts shown above actually implies that most of the apps having good ratings of above 4.0 are mostly confirmed to have high amount of reviews and user installs. The size and price shouldn't reflect that apps with high rating are mostly big in size and pricy as by looking at the graphs they are most probably are due to some minority. Futhermore, most of the apps that are having high amount of reviews are from the categories of SOCIAL, COMMUNICATION and GAME like Facebook, WhatsApp Messenger, Instagram, Messenger – Text and Video Chat for Free, Clash of Clans ,google apps etc.

Eventhough apps from the categories like GAME, SOCIAL, COMMUNICATION and TOOL of having the highest amount of installs, rating and reviews are reflecting the current trend of Android users, they are not even appearing as category in the top 5 most expensive apps in the store . As a conclsuion, we learnt that the current trend in the Android market are mostly from these categories which either assisting, communicating or entertaining apps.

Some important point:-
Average rating of (active) apps on Google Play Store is 4.17.

If we see individually app wise the communicatation app like facebook and whatsup get highly reviewed app it shown that people regulary active on that and give there feedback also on that.

Medical and Family apps are the most expensive and even extend upto 80$.
Users tend to download a given app more if it has been reviewed by a large number of people.

More than half users rate Family, Sports and Health & Fitness apps positively. Apps for games and social media get mixed reviews, with 50 percent positive and 50 percent negative responses.

# **Conclusion**

In this project of analyzing play store applications, we have worked on several parameters which would help our client to do well in launching their apps on the play store.

1. 92.2% of apps are of free type and 7.8% are paid
2. The minimum rating recieved in Education is 3.5, followed by Art and Design is 3.4, while the lowest ratings are recieved by Dating, Finance. Game, Tools, Communication, Business, productivity, Medical and family are 1. while the average ratings recieved by these categories are fairly good and above 4.1.
3. Ratio of paid and free apps in various categories we see that personalization and medical categories have mor paid apps percentage compare to other categories
4. the count of family apps is more compare to the other categories approximalty 1800 apps belong to the Family category where as Game category has around 950 apps
5. It is observed that the games have more installs followed by communication and tools
6. we found that 81.81% of apps can be rated by Everyone while 10.74% of apps are rated by teens, further the ratings and app usage is narrowed by everyone 10+ and mature 17+ with 4.07 and 3.33% respectively.
7. it is found that the max paid apps charge 0.99 dollars followed by 2.99 dollars and 1.99 dollars, this gives us a clue of having an affordable price of app is equally important.
8. Education, Beauty, Entertainment, parenting, weather and Maps and Navigation category have ratings distributed uniformly since there are less wiskers outside the box.
9. Ther ratings are normally distributed from 1 to 5 where the mean is 4.2 and median is 4.3, there was one outlier which was handled in data wrangling step.
10. With user df we found that 64.11% of recieved translated reviews are positive where as 22.10% of apps have negative sentiments and 13.79% of apps have neutral sentiments towards the apps whcih recied the reviews.
11. the Game categoy has most number of translated reviews followed by Family(plotted in chart 12 above)
12. when checking with sentiments accross various app categories we found that social apps have approximatly 50% of negative sentiments
13. We see that the correlation between Installs and Price is 0.63 which is good, It means that when there are more reviews on any app the more people tend to install the app. so its a better idea to get reviews on the app.
14. we found that there is a relation between the reviews and installs, the max installs the max reviews we get, the regression lines is liner and increases as (x=y approximatly)
15. The installs doesn't grow as the price increases.
16. Size and installs does not have relation.
17. the diagonal graphs in pair show the kde plotted wrt the own feature, it is noted that the diagonal plots doesnot follow the scale.
18. We categorised apps with various size into 4 category, That is Lite size, Medium Size, Large size, and extra large size and found that most of the people prefer to have lite size apps i.e 75% approx where as the other apps are marginally spread.(may depend on the use case and their increased functionality) extra large apps are of share 6% while medium and large apps share 9.5% of the total market.(as of 2018 data)
19. We see that most of the gaming apps are of all the size category and comunication apps are preferred to be made of lite size.
20. looking at trends we see most of the apps are tried to be made lite as it would lead to be a better option.
21. We extracted year of last update and found that most of the apps are recently updated and some of them have been not updated since a long time. For more detail analysis we can selectily visit the apps with old year of updates and check if the apps are stoped due to losses or closed.
22. We generated list of top apps with respect to Reviews, size, Installs, Price, top free. (listed in chart 19 section)
23. we found that some apps are duplicated with same names but different spelling, It is recommended to filter/get more accurate data. where as updating data with code would not be that easy to filter as there are spelling differences in the app names.(example I am Rich app)

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***
