<a href="https://colab.research.google.com/github/abhijeet8825/CAPSTONE-Play-Store-App-Review-Analysis-Abhijeet-Kumar/blob/main/Playstore_App_Review_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - **Playstore App Review Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


# **Project Summary -**

Google Play Store apps and reviews Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.

Let's take a look at the data, which consists of two files:

playstore data.csv: contains all the details of the applications on Google Play. There are 13 features that describe a given app.
user_reviews.csv: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.

# **GitHub Link -**

https://github.com/abhijeet8825

# **Problem Statement**


* What are the top categories on Play Store?
* Are majority of the apps Paid or Free?
* How importance is the rating of the application?
* Which categories from the audience should the app be based on?
* Which category has the most no. of installations?
* How does the count of apps varies by Genres?
* How does the last update has an effect on the rating?
* How are ratings affected when the app is a paid one?
* How are reviews and ratings co-related?
* Lets us discuss the sentiment subjectivity.
* Is subjectivity and polarity proportional to each other?
* What is the percentage of review sentiments?
* How is sentiment polarity varying for paid and free apps?
* How Content Rating affect over the App?
* Does Last Update date has an effects on rating?
* Distribution of App update over the Year.
* Distribution of Paid and Free app updated over the Month.




#### **Define Your Business Objective?**

* Ratings should be above 4.5 of all apps
* Content should be better which automaticaly boosts your rating.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.filterwarnings("ignore")


### Dataset Loading

In [None]:

# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
path='/content/drive/MyDrive/Data/Play_Store_Data.csv'
df=pd.read_csv(path)

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
df.shape

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
plt.figure(figsize = (8,9))
sns.heatmap(df.isnull(), cbar=False)

### What did you know about your dataset?

* The dataset has 10481 rows and 13 columns.
* Rating column has 1474 missing values.
* There are 483 duplicated values in this Dataset.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe()

### Variables Description

* App - It tells us about the name of the application with a short description (optional).
* Category - It gives the category to the app.
* Rating - It contains the average rating the respective app received from its users.
* Reviews - It tells us about the total number of users who have given a review for the application.
* Size - It tells us about the size being occupied the application on the mobile phone.
* Installs - It tells us about the total number of installs/downloads for an application.
* Type - It states whether an app is free to use or paid.
* Price - It gives the price payable to install the app. For free type apps, the price is zero.
* Content Rating - It states whether or not an app is suitable for all age groups or not.
* Genres - It tells us about the various other categories to which an application can belong.
* Last Updated - It tells us about the when the application was updated.
* Current Ver - It tells us about the current version of the application.
* Android Ver - It tells us about the android version which can support the application on it's platform.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
# Handling null values in current version
df[df["Current Ver"].isnull()]

There are 8 values in current ver are missing, It is less in number so it is good option to drop null values which will also not affect the data.

In [None]:
# dropping rows corresponding to the to the NaN values in the 'Current Ver' column.
df=df[df['Current Ver'].notna()]

# Shape of updated DataFrame
df.shape


In [None]:
# Handling null values in Andriod ver column
df[df["Android Ver"].isnull()]

There are 3 missing values in Android ver. It is good to drop those values.

In [None]:
# dropping rows corresponding to the to the NaN values in the 'Android Ver' column.
df=df[df['Android Ver'].notna()]
# Shape of the updated dataframe
df.shape

In [None]:
# Handling null value in Type column
df[df["Type"].isnull()]

There is only one missing value in type column. It is good to drop that one.

In [None]:
# dropping rows corresponding to the to the NaN values in the 'Type' column.
df = df[df["Type"].notna()]
# Shape of the updated dataframe
df.shape

In [None]:
# Handlinng missing value in Rating column
df[df["Rating"].isnull()]

In [None]:
df["Rating"].isnull().sum()

There are 1469 values are missing in Rating column.If we drop the all missing values from that column then we might miss some important data.Droping missing values are not good option for this column. So we will go for mean or median imputer.

In [None]:
# Checking for outlier in Rating caolumn
plt.figure(figsize=(8,6))
sns.histplot(x = df.Rating, kde= True)
plt.show()

In [None]:
# CHecking outlier with boxplot
plt.figure(figsize=(8,6))
sns.boxplot(df.Rating)

This column data is left skewed and have alot of outliers. we will go for median imputer.

In [None]:
# Replacing the NaN values in the 'Rating' colunm with its median value
df['Rating'].fillna(df["Rating"].median(),inplace=True)
df[df["Rating"].isnull()]

In [None]:
# Haandling duplicates in App column
df["App"].value_counts()

In [None]:
df[df["App"].duplicated()]

In [None]:
df.drop_duplicates(subset = "App",inplace = True)

In [None]:
df["App"].duplicated().sum()

In [None]:
# Changing datatype of Last Updated column from string to date time
df["Last Updated"] = pd.to_datetime(df['Last Updated'])
df["Last Updated"].dtype

In [None]:
# Changing data type of price column from string to float

df["Price"].value_counts()

In [None]:
df["Price"]=df["Price"].str.replace("$","")
df.loc[234:235]

In [None]:
df["Price"] = df["Price"].astype(float)
df["Price"].dtype

In [None]:
# Checking the contents of the 'Installs' column
# Changing data type of price column from string to float

df["Installs"].value_counts()

In [None]:
# Removing '+' sign from installs
df["Installs"] = df["Installs"].str.replace("+", "")
df["Installs"] = df["Installs"].str.replace(",", "")
df["Installs"].value_counts()

In [None]:
#Changing data type of price column from string to float
df["Installs"] = df["Installs"].astype(int)
df["Installs"].dtype

In [None]:
 # Converting the values in the Size column to a same unit of measure(MB).
df["Size"].value_counts()

We can see that the values in the Size column contains data with different units. 'M' stands for MB and 'k' stands for KB. To easily analyse this column, it is necessary to convert all the values to a single unit. In this case, we will convert all the units to MB.

We know that 1MB = 1024KB, to convert KB to MB, we must divide all the values which are in KB by 1024.

In [None]:
# Defining a function to convert all the entries in KB to MB and then converting them to float datatype.

def convert_kb_to_mb(val):
  '''
  This function converts all the valid entries in KB to MB and returns the result in float datatype.
  '''
  try:
    if 'M' in val:
      return float(val[:-1])
    elif 'k' in val:
      return round(float(val[:-1])/1024, 4)
    else:
      return val
  except:
    return val

In [None]:
# The kb_to_mb funtion applied to the size column
df['Size'] = df['Size'].apply(lambda x: convert_kb_to_mb(x))
df.head()

In [None]:
df[df["Size"]== "Varies with device"]

We can see from output of above code that Size column has lot of rows filled with varies with device.These rows need to tranform in numerical values for further analysis.
I will replace that with nan and impute nan with median or mean depends on outlier and skewness.

In [None]:
# Replacing varies with device with nan
df['Size'] = df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(x))

In [None]:
df[df["Size"]== 'NaN']

In [None]:
# Checking for outlier
plt.figure(figsize=(8,6))
sns.histplot(df.Size, kde = True)


Size column is right skewed. It means there are lot of outliers in this column.

In [None]:
plt.figure(figsize=(8,6))
sns.boxplot(df.Size,)
plt.show()

In [None]:
df.head()

In [None]:
# Converted data type of reviews from string to int
df["Reviews"] = df["Reviews"].astype(int)
df["Reviews"].dtype

### Interacting with User_Review Dataset

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')
path='/content/drive/MyDrive/Data/User_Reviews.csv'
data=pd.read_csv(path)

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.info()

In [None]:
data.isnull().sum()

In [None]:
data.dropna(inplace = True)
data.shape

### What all manipulations have you done and insights you found?

### Playstore App dataset manipulations and insights

* First of all, I checked for dimensions of dataset where i found 10841 rows and 13 columns.
* I Checked for missing values where i got 1474 missing values in Rating, 1 missing value in Type, 1 missing value in Genres, 8 missing values in curent ver and 3 missing values in Android ver.
* I dropped all missing values from these column except Rating column because Rating column had lot of missing values.I used median to impute values after checking outliers and skewness.
* I checked for duplicates values in app columns which had 1181 rows duplicated.I droped all of them.
* I converted Price column from string to flot datatype, before converting I removed '$' sign of that column.
* I converted Installs column from string to float datatype.
* I I converted Size column data from string to flot before converting I brought the data into same measure M.B because that column contained data in both K.B and M.B.
* I performed visualization on Size column.The insights i got that was data contained so many outliers and data was left skewed.



### User_Review insights and manipulations

* I checked for the dimensions of dataset.I found 64295 rows and 5 columns.
* I checked for null values there were lot of null values present in dataset.I droped all of them because there was no any relation data.I thought dropping would be a good option.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1  What are the ratio of numbers of Paid apps and Free apps?

In [None]:
# Chart - 1 visualization code
data =df['Type'].value_counts()
labels = ['Free', 'Paid']

# create pie chart
plt.figure(figsize=(7,7))
colors = ["#00EE76","#7B8895"]
explode=(0.01,0.1)
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free apps',size=15,loc='center')
plt.legend()

##### 1. Why did you pick the specific chart?

I Have used this pie chart to see the proportions of paid and free apps.

##### 2. What is/are the insight(s) found from the chart?

* 92.20% out of 100% is free apps on playstore.
* 7.80% is paid out of 100%

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

It's obviious that so many apps are freely available. So we should go with the flow.

#### Chart - 2 Top categories on playstore

In [None]:
# Chart - 2 visualization code
x =df['Category'].value_counts()
y =df['Category'].value_counts().index
x_list = []
y_list = []
for i in range(len(x)):
    x_list.append(x[i])
    y_list.append(y[i])

In [None]:
#Number of apps belonging to each category in the playstore
plt.figure(figsize=(20,10))
plt.xlabel('App category', size=15)
plt.ylabel('Number of Apps', size=15)
graph = sns.barplot(y = x_list, x = y_list, palette= "tab10")
graph.set_title("Top categories on Playstore", fontsize = 25)
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right',);

##### 1. Why did you pick the specific chart?

I have used this bar graph to see which app category includes most number of apps.

##### 2. What is/are the insight(s) found from the chart?

### Top 3 categories on playstore
* Family
* Game
* Tools

Rest of the categories are decreasing as we move from left to right.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Of Course , if we observe the graph carefully we can get the insight that we should develop the apps related to the category which has less number of apps available.

#### Chart - 3 Numbers of apps installs in each category

In [None]:
# Chart - 3 visualization code
# total app installs in each category of the play store

a =df.groupby(['Category'])['Installs'].sum().sort_values()
a.plot.barh(figsize=(15,10), color = 'c', )
plt.ylabel('App category', fontsize = 13)
plt.xlabel('Total app Installs', fontsize = 13)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)

##### 1. Why did you pick the specific chart?

I have used this chart to see which apps category has highest number of installs.

##### 2. What is/are the insight(s) found from the chart?

* Game
* Communication
* Tools
* Productivity
* Social

These are the most installed categories.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

I do not see any category that have negative growth but there are some apps category like events,beauty,medical,parenting etc which have less number of install, so company have to work on them so that numbers of installs can increase of that categories.

#### Chart - 4  Average ratings of apps

In [None]:
# Chart - 4 visualization code
# Average app ratings

df['Rating'].value_counts().plot.bar(figsize=(20,8), color = 'm' )
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()

##### 1. Why did you pick the specific chart?

I have used this bar chart to see average rating of apps.

##### 2. What is/are the insight(s) found from the chart?

### Top 5 rated apps data in millions
* More than 2000 apps has 4.3 ratings.
* Almost 900 apps has 4.4 ratings.
* Almost 800 apps has 4.5 ratings.
* Almost 700 apps has 4.2 ratings.
* Almost 600 apps has 4.5 ratings.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

From the above graph we can also see the numbers of lowest rated apps.We can work on those apps to improve the ratings.

#### Chart - 5 What are the Top 10 installed apps?

In [None]:
# Chart - 5 visualization code
def findtop10incategory(str):
    str = str.upper()
    top10 = df[df['Category'] == str]
    top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
    plt.figure(figsize=(15,6), dpi=100)
    plt.title('Top 10 Installed Apps',size = 20)
    graph = sns.barplot(x = top10apps.App, y = top10apps.Installs, palette= "icefire")
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right')

findtop10incategory('GAME')

##### 1. Why did you pick the specific chart?

I have used this chart to see which apps has highest numbers of installs.

##### 2. What is/are the insight(s) found from the chart?

* Subway surfers has highest numbers of installs.
* Candy crush saga, temple run 2, pou, and my talking tom have same numbers of installs. We can say these apps have more numbers of installs compared to rest 5 apps.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As we know these apps belongs to game category and these apps have more numbers of installs. We will work on more game applications.

#### Chart - 6  Percent of apps share in each category

In [None]:
# Chart - 6 visualization code
# Percentage of apps belonging to each category in the playstore
plt.figure(figsize=(30,30))
plt.pie(df.Category.value_counts(), labels=df.Category.value_counts().index, autopct='%1.2f%%')
my_circle = plt.Circle( (0,0), 0.50, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('% of apps share in each Category', fontsize = 25)
plt.show()

##### 1. Why did you pick the specific chart?

I have used donut chart to see percentage of apps share in each category.

##### 2. What is/are the insight(s) found from the chart?

* Family category has 18.95% apps
* Game category has 9.94% apps
* Tools category has 8.55% apps

Apps share is decreasing as we move anticlockwise.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

As so many category has less amount of apps share.We should work on increasing those shares in each category.

#### Chart - 7  Average app ratings

In [None]:
# Defining a function grouped_rating to group the ratings as mentioned above
def Rating_app(val):
  ''''
  This function help to categories the rating from 1 to 5
  as Top_rated,Above_average,Average & below Average
  '''
  if val>=4:
    return 'Top rated'
  elif val>3 and val<4:
    return 'Above Average'
  elif val>2 and val<3:
    return 'Average'
  else:
    return 'Below Average'

In [None]:
# Applying grouped_rating function
df['Rating_group']=df['Rating'].apply(lambda x: Rating_app(x))

In [None]:
# Chart - 7 visualization code
# Average app ratings
df['Rating_group'].value_counts().plot.bar(figsize=(15,5), color = 'royalblue')
plt.xlabel('Rating Group', fontsize = 12)
plt.ylabel('Number of apps', fontsize = 12)
plt.title('Average app ratings', fontsize = 18)
plt.xticks(rotation=0)
plt.legend()


##### 1. Why did you pick the specific chart?

This Chart is used to show how many apps belongs to these rating group

##### 2. What is/are the insight(s) found from the chart?

* Almost 8000 apps has highest ratings
      * Top rating includes rating above 4
* Almost 2000 apps has Above average Ratings
      * Above  Average Rating includes Ratings between 3 and 4
* Almost 500 apps has Average and Below average Ratings.
      * Average Rating includes ratings between 2 and 3
      * Below Average rating includes ratings below 2

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can say from these insights that our bussiness is doing good but we should also focus on Average and below average rating groups to enhance our bussiness.

#### Chart - 8  In which category most free apps belong?

In [None]:
 # Creating a df for only free apps

free_df = df[df['Type'] == 'Free']

In [None]:
# Creating a df for top free apps

top_free_df = free_df[free_df['Installs'] == free_df['Installs'].max()]
top10free_apps=top_free_df.nlargest(10, 'Installs', keep='first')
top10free_apps.head(10)

In [None]:
# Top free apps

top_free_df['App']

In [None]:
# Chart - 8 visualization code
# Categories in which the top 20 free apps belong to
top_free_df['Category'].value_counts().plot.bar(figsize=(20,5), color= ('darkcyan','blueviolet'))
plt.xlabel('Category', size=15)
plt.ylabel('Number of apps', size=15)
plt.title('Categories in which most free apps belong', size=19)
plt.xticks(rotation=45)
plt.legend()

##### 1. Why did you pick the specific chart?

I have used this chart to show which category has most numbers of Free apps

##### 2. What is/are the insight(s) found from the chart?

Top 3 category which includes most free apps
  * Communication
  * Social
  * Travel and local

Rest of the category has same numbers of free apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9 Top category in paid apps

In [None]:
# Creating a df containing only paid apps
paid_df=df[df['Type']=='Paid']

In [None]:
# Creatng a new column 'Revenue' in paid_df
paid_df['Revenue'] = paid_df['Installs']*paid_df['Price']
paid_df.head()

In [None]:
# Top 10 paid apps in the play store
top10paid_apps=paid_df.nlargest(10, 'Revenue', keep='first')
top10paid_apps['App']

In [None]:
# Categories in which the top 10 paid apps belong to
top10paid_apps['Category'].value_counts().plot.bar(figsize=(15,5), color= ["orange", "red", "green", "blue", "purple"])
plt.xlabel('Category',size=12)
plt.ylabel('Number of apps',size=12)
plt.title('Categories in which the top 10 paid apps belong', size=15)
plt.xticks(rotation=0)
plt.legend()

##### 1. Why did you pick the specific chart?

This chart is used to show more number of paid apps belongs to which category.

##### 2. What is/are the insight(s) found from the chart?

* Lifestyle category has highest numbers of paid apps.
* Game category has less numbers of apps compared to lifestyle

Rest of the 3 category has same numbers of paid apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see lifestyle category has most numbers of paid apps. So I would suggest the developer to create free apps in lifestyle category.

#### Chart - 10 Top paid apps generated revenue

In [None]:
# Chart - 10 visualization code
# Top paid apps according to the revenue generated through installs alone
top10paid_apps.groupby('App')['Revenue'].mean().sort_values(ascending= True).plot.barh(figsize=(15,10), color='darkorange')
plt.xlabel('Revenue Generated (USD)', size=11)
plt.title('Top apps based on revenue generated through installation fee', size=20)
plt.legend()

##### 1. Why did you pick the specific chart?

This bar graph has been used for showing which apps has generated more revenue

##### 2. What is/are the insight(s) found from the chart?

Top 3 Apps according to revenue generated
* Minecraft
* I am Rich
* I am Rich Premium

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Minecraft has generated highest revenue compared to rest of the apps so we go with development of more minecraft app.

#### Chart - 11 Percentage of Reviews in Sentiments

In [None]:
# Chart - 11 visualization code

counts = list(data['Sentiment'].value_counts())
labels = 'Positive Reviews', 'Negative Reviews','Neutral Reviews'
plt.rcParams['font.size'] = 20
plt.rcParams['figure.figsize'] = (10, 15)
plt.pie(counts, labels=labels, explode=[0.01, 0.05, 0.05], shadow=True, autopct="%.2f%%")
plt.title('Percentage of Review Sentiments', fontsize=20)
plt.axis('off')
plt.legend(bbox_to_anchor=(0.9, 0, 0.5, 1))
plt.show()

##### 1. Why did you pick the specific chart?

I have choosen this chart to show proportions of positive,negatve and neutral reviews.

##### 2. What is/are the insight(s) found from the chart?

* 64.12% are positive Reviews
* 22.10% are Negative Reviews
* 13.78% are Neutral Reviews

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see that 22.10% are negative reviews.So we have to work on negative reviews and tranform them to positive by resolving the bugs that users are facing.

#### Chart - 12 Free and Paid apps added over time period

In [None]:
# Chart - 12 visualization code
paid_df["Apps Added"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')
free_df["Apps Added"] = free_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64')

In [None]:

paid_df.groupby("Apps Added")["App"].count().plot.line(marker='o')
free_df.groupby('Apps Added')['App'].count().plot.line(marker='o')

##### 1. Why did you pick the specific chart?

I have used this line graph to know about the trend over time period.

##### 2. What is/are the insight(s) found from the chart?

Blue line denotes paid apps and orange line denotes free apps
* Before 2011 there were no any paid apps available.It was introduced after 2011.
* There was Free apps available before 2011 but it was less in numbers.we can see after 2013 numbers of free apps has been introduces rapidly which surpassed the paid apps.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can see numbers of free apps are inceasing over period of time.Free apps are demand of market so we should go with the flow.

#### Chart - 13 Free and paid apps update over month





In [None]:
paid_df["Update month"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64')
free_df["Update month"] = free_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64')

In [None]:
# Chart - 13 visualization code
free_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color='purple')
plt.title("Free Apps update over the month", size=20)
plt.legend()

In [None]:
paid_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color= "green")
plt.title("Paid Apps update over the month", size=20)
plt.legend()

##### 1. Why did you pick the specific chart?

I Have used this bar chart to know in which month paid and free apps get update.

##### 2. What is/are the insight(s) found from the chart?

* most of the Paid apps gets update in 7th month
* most of the free apps also gets update in 7th month

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
df.corr(numeric_only = True)

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize = (15,10))
sns.heatmap(df.corr(numeric_only = True), annot= True)
plt.title('Correlation Heatmap for Playstore Data', size=20)

##### 1. Why did you pick the specific chart?

I have picked this heatmap to see the corelation between each numeric column

##### 2. What is/are the insight(s) found from the chart?

* There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users.
* The Price is slightly negatively correlated with the Rating, Reviews, and Installs. This means  as the prices of the app increases, the average rating, total number of reviews and Installs falls slightly.
* The Rating is slightly positively correlated with Installs and Reviews column. This indicates as the average user rating increases, the app installs and number of reviews also increase.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

Rating =df['Rating']
Installs =df['Installs']
Reviews =df['Reviews']
Type = df['Type']
Price =df['Price']

p = sns.pairplot(pd.DataFrame(list(zip(Rating, np.log(Installs), np.log10(Reviews), Price , Type)),
                        columns=['Rating', 'Installs', 'Reviews', 'Price','Type']), hue='Type')
p.fig.suptitle("Pairwise Plot - Rating, Installs, Reviews, Price",x=0.5, y=1.0, fontsize=16)

##### 1. Why did you pick the specific chart?

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

##### 2. What is/are the insight(s) found from the chart?

* Most of the Apps are Free.
* Most of the Paid Apps have Rating around 4
* As the number of installation increases the number of reviews of the particaular app also increases.
* Most of the Apps are less in size.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

* Developing apps related to the least categories as they are not explored much. Like events and beauty.
* Most of the apps are Free, so focusing on free app is more important.
* Focusing more on content available for Everyone will increase the chances of getting the highest installs.
* They need to focus on updating their apps regularly, so that it will attract more users.
* They need to keep in mind that the sentiments of the user keep varying as they keep using the app, so they should focus more on users needs and features.

# **Conclusion**

* Percentage of free apps = ~92%
* Most competitive category: Family
* Category with the highest average app installs: Game
* 4.3 raings is average rating
* Subway surfers has highest numbers of installs
* communication category has highest numbers of free apps.
* Percentage of apps that are top rated = ~80%
* Family, Game and Tools are top three categories having 1906, 926 and 829 app count.
* Minecraft is the only app in the paid category with over 10M installs. This app has also produced the most revenue only from the installation fee.
* Overall sentiment count of  dataset in which Positive sentiment count is 64%, Negative 22% and Neutral 13%.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***