<a href="https://colab.research.google.com/github/Wolverine-Shiva/Play-Store-App-Review-Analysis-EDA/blob/main/Play_Store_App_Review_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - Play Store App Review Analysis



##### **Project Type**    - EDA
##### **Contribution**    - Team
##### **Team Member 1 -** Shiva Singh
##### **Team Member 2 -**
##### **Team Member 3 -**
##### **Team Member 4 -**

# **Project Summary -**

The Play Store App Review Analysis EDA Project is a comprehensive examination of user reviews from the Google Play Store. Its primary goal is to gain valuable insights into user sentiments, uncover common issues, and provide recommendations for app developers to enhance their applications. This project involves various stages, including data collection, preprocessing, exploratory data analysis (EDA), visualization, and the extraction of actionable insights.

**Data Collection:**

The project begins by obtaining a dataset of app reviews from the Play Store. This dataset typically includes key information such as review text, user ratings, dates, and potentially other metadata. Data collection methods can involve web scraping tools or access to publicly available datasets.

**Data Preprocessing:**

Data preprocessing is a crucial step in ensuring the quality of the dataset. This involves cleaning the data by removing duplicates, addressing missing values, and converting data types when necessary. Additionally, text preprocessing techniques are applied to the review text, including tokenization, the removal of special characters and stopwords, and lemmatization or stemming.

**Exploratory Data Analysis (EDA):**

EDA is the heart of the project, where various analytical techniques are applied to extract insights from the dataset. This includes calculating descriptive statistics such as mean, median, and standard deviation for ratings. Distribution analysis is conducted through visualizations like histograms and box plots to assess overall user sentiment. Time analysis helps identify trends and seasonality in reviews over time. Word clouds are generated to visualize the most common words in positive and negative reviews, while sentiment analysis techniques classify reviews as positive, negative, or neutral. Topic modeling is also employed to uncover common themes or topics within the reviews.

**Visualization:**

Effective data visualization is employed to communicate findings clearly. Word clouds visualize frequently mentioned words in positive and negative reviews. Bar charts or pie charts display the distribution of sentiments (positive, negative, neutral), providing a quick overview of user sentiment trends. Line charts track changes in app ratings and sentiments over time, helping to identify patterns and fluctuations.

***Conclusion:***

In conclusion, the Play Store App Review Analysis EDA Project is a data-driven initiative aimed at helping app developers better understand user sentiments and improve their applications. The project encompasses data collection, preprocessing, exploratory data analysis, visualization, and the extraction of actionable insights. The recommendations and findings are essential for developers to enhance their apps and cater to user needs effectively.

# **GitHub Link -**

[GitHub Link](https://github.com/Wolverine-Shiva/Play-Store-App-Review-Analysis-EDA)

# **Problem Statement**


Explore and analyze the data to discover key factors responsible for app engagement and success.

#### **Define Your Business Objective?**

The objective of this project is to help app developers and businesses understand what factors make their apps successful. By analyzing the play_store_data and user_review_data datasets, we will identify relevant KPIs. This information will be used to provide insights and recommendations on how to improve app engagement, retain users, increase revenue, and enhance marketing strategies. Ultimately, my goal is to help businesses create successful apps that satisfy their customers and drive growth.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')

### **1st Step: Data Cleaning on Play Store dataset**

In [None]:
play_store_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone Projects/Play Store App Review/Play Store Data.csv')

### Dataset First View

In [None]:
# Dataset First Look to showing first 10 rows of play_store_data
play_store_data.head(10)

In [None]:
# using tail() to showing last 10 rows of play_store_data
play_store_data.tail(10)

### Dataset Rows & Columns count

In [None]:
# counting the rows & columns of play_store_data
play_store_data.shape

### Dataset Information

In [None]:
# looking the info of play_store_data
play_store_data.info()

#### Duplicate Values

In [None]:
# counting the dublicate values of play_store_data
len(play_store_data[play_store_data.duplicated()])

#### Missing Values/Null Values

In [None]:
# counting the missing values/null values of play_store_data
print(play_store_data.isnull().sum())

In [None]:
# now visualizing the missing values of play_store_data using seaborn heatmap
sns.heatmap(play_store_data.isnull(), cbar=False)

### What did you know about your dataset?

As we can see above, here are


1.   1474 missing values in the Rating column
2.   1 missing value in the Type column
3.   1 missing value in the Content Rating column
4.   3 missing values in the Android Ver column

So there are rows in the dataset that have missing values in these columns, which could potentially affect any analysis or modeling. It is important to handle these missing values appropriately.:

## ***2. Understanding Your Variables***

In [None]:
# loading the dataset columns of play_store_data
play_store_data.columns

The dataset has 13 columns are identified as below:

1. **App** - Title of the app.

2. **Category** - Category for each app.

3. **Rating** - Rating given by the users.

4. **Reviews** - Reviews given by the users.

5. **Size** - The size of the application.

6. **Installs** - How much installation and downloads done.

7. **Type** - The application is free or paid version.

8. **Price** - The price of the app on Play Store.

9. **Content Rating** - Suitable for all age groups or not.

10. **Genres** - The various other categories.

11. **Last Updated** - When the application was updated.

12. **Current Ver** - Current version of the application.
13.**Android Ver** - Supportable android version for application.

In [None]:
# describing the play_store_data
play_store_data.describe()

### Variables Description

It shows the Statistical Summary of the Play Store.

* **Count:** the number of non-null values 9367
* **Mean:** the average value of the column is 4.193338
* **Std:** the standard deviation of the column is 0.537431
* **Min:** the minimum value in the column is 1.000000
* **25%:** the first quartile value (25th percentile) of the column is 4.000000
* **50%:** the median value (50th percentile) of the column is 4.300000
* **75%:** the third quartile value (75th percentile) of the column is 4.500000
* **Max:** the maximum value in the column is 19.000000

### Check Unique Values for each variable.

In [None]:
# Now checking the unique values for each variable in play_store_data
for i in play_store_data.columns.tolist():
  print("Unique values in",i,"is",play_store_data[i].nunique())

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# creating a new empty dataframe to new variable
play_store_data_new = pd.DataFrame(index=play_store_data.columns)

# now adding datatype, not_null, and null columns to play_store_data_new
play_store_data_new["DataType"] = play_store_data.dtypes
play_store_data_new["not_null"] = play_store_data.count()
play_store_data_new["null"] = play_store_data.isnull().sum()

# printing the dataframe
play_store_data_new

In [None]:
# checking the rating of play_store_data
play_store_data.boxplot()

In [None]:
play_store_data[(play_store_data['Rating'] <1) | (play_store_data['Rating']>5)]

**There is some problem is this row**
    
*   The app rating is 19
*   NaN values present in the Content Rating and Android Ver    

That is why we're droping this row

In [None]:
# now we're removing the above row from play_store_data
play_store_data.drop(10472,axis=0,inplace=True)

In [None]:
# Time to check it is removed or not
play_store_data.boxplot();

In this play store dataset, there is still 1474 null values in "Rating" that is why we have to fix them so for this first we find the mean and median

In [None]:
# finding the mean of Rating column from play_store_data
rating_mean = play_store_data['Rating'].mean()
print(f"The mean of Rating column is {rating_mean}")

# now finding the median of Rating column from play_store_data
rating_median = play_store_data['Rating'].median()
print(f"The median of Rating column is {rating_median}")


*   The mean of Rating Column comes to be 4.2
*   The Median of Rating Column comes to be 4.3

Here is not much difference between mean and median, that's why we'll replace all the null values with median because it shows that 50% of apps have more than 4.3 rating

In [None]:
# filling all the null values by the median in Rating column of play_store_data
play_store_data['Rating'].fillna(value=rating_median,inplace=True)

In [None]:
# Time to check the null values is filled or not
print(play_store_data.isnull().sum())

Now, there are still some problems that we have to fix...

*   The Type has 1 NaN value.
*   The current Var contains 8 NaN values.
*   The Android Var contains 2 NaN values.

In [None]:
# checking the Type's NaN value in Type column of play_store_data
play_store_data[(play_store_data['Type'].isnull())]


In the Type, the price is 0, so basically it is a free app, in this metter we replace that NaN value with Free.

In [None]:
# counting the free and paid versions of applications in play_store_data
play_store_data['Type'].value_counts()

In [None]:
# now we replace the NaN value of Type
play_store_data.loc[9148,'Type']='Free'

In [None]:
# Time to check it is fixed or not
play_store_data[(play_store_data['Type'].isnull())]

In [None]:
# Time to fix the Current Ver's null values in play_store_data
play_store_data[(play_store_data['Current Ver'].isnull())]


At above, the Current Ver has NaN values, in this metter we'll drop NaN values.

In [None]:
# droping all the NaN values of Current Ver column from play_store_data
play_store_data.drop([15,1553,6322,6803,7333,7407,7730,10342],axis=0,inplace=True)

In [None]:
# Time to check all the Nan values are droped or not
play_store_data[(play_store_data['Current Ver'].isnull())]

In [None]:
# checking the Android Ver's NaN value which is shown when we used this commmand: print(play_store_data.isnull().sum())
play_store_data[(play_store_data['Android Ver'].isnull())]

At this, the Android Ver has NaN values, in this metter we'll drop NaN values also.

In [None]:
# droping the NaN values of Android Ver column from play_store_data
play_store_data.drop([4453,4490],axis=0,inplace=True)

In [None]:
# Time to check these Nan values are droped or not
play_store_data[(play_store_data['Android Ver'].isnull())]

### **Time to Handle all Data Types for Every Columns**

In [None]:
# checking the data type in Size column in play_store_data
play_store_data['Size']


The "Size" has different units.

*   M means MB
*   K means KB

So We'll fix this using convert them into single unit.

In [None]:
# checking the data type in Price column in play_store_data
play_store_data['Price'].value_counts()

The $ sign maybe a problem in "Price" so We'll remove them.

In [None]:
# checking the data type in Installs column in play_store_data
play_store_data['Installs'].value_counts()

In this "Installs" we have to remove these signs: +(plus) and ,(comma)

In [None]:
# we're trying to remove +(plus) and ,(comma) from Installs
def remove_from_install(a):
    if type(a) == str:
        a = a.replace(',', '').replace('+', '')
    return a

play_store_data['Installs'] = play_store_data['Installs'].apply(remove_from_install)

In [None]:
# we're trying to remove dollar sign from Price
def remove_from_price(b):
    if type(b) == str and '$' in b:
        b = b.replace('$', '')
    return b

play_store_data['Price'] = play_store_data['Price'].apply(remove_from_price)

In [None]:
# now we're converting 'Reviews' to numeric in play_store_data
play_store_data['Reviews'] = play_store_data['Reviews'].astype(float)

# cleaning and converting 'Size' to numeric
def clean_size(x):
    if 'Varies with device' in str(x):
        return np.nan
    elif 'k' in str(x):
        return float(str(x).replace('k', '')) / 1024
    else:
        return float(str(x).replace('M', ''))


play_store_data['Size'] = play_store_data['Size'].apply(clean_size)

# cleaning and converting 'Installs' to numeric
play_store_data['Installs'] = play_store_data['Installs'].replace('[^\d]', '', regex=True).astype(float)


# cleaning and converting 'Price' to numeric
play_store_data['Price'] = play_store_data['Price'].replace('[^\d\.]', '', regex=True).astype(float)

# time to print
play_store_data.info()

### **Time to Remove Duplicate Apps**

In [None]:
# using head()
play_store_data.head(10)

In [None]:
# checking the values of App from play_store_data
play_store_data['App'].value_counts()

In [None]:
# calculating the duplicate value in "App" column in play_store_data
play_store_data['App'].duplicated().sum()

Here are 1181 dublicates values in column which we'll remove.

In [None]:
# dropping duplicates value in "App" in play_store_data
play_store_data.drop_duplicates(subset='App',inplace=True)

In [None]:
# time to check duplicates are removed or not
play_store_data['App'].duplicated().sum()

## **Summary**

*   All duplicates values removed from dataset.
*   All null values are removed or replaced.  
*   Converted the datatypes of the particular column and also removed all the unwanted characters.

### **2nd Step: Data Cleaning on User Review dataset**

In [None]:
# now we're importing the User Reviews.csv from Google Drive as user_reviews_data
user_reviews_data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Capstone Projects/Play Store App Review/User Reviews.csv')

In [None]:
# using head() to show top 10 rows of user_reviews_data
user_reviews_data.head(10)

In [None]:
# using tail() to show last 10 rows of user_reviews_data
user_reviews_data.tail(10)

In [None]:
# checking the shape (number of rows and columns) of user_reviews_data
user_reviews_data.shape

In [None]:
# checking the columns of user_reviews_data
user_reviews_data.columns

# The dataset has 5 columns are identified as below:


1.   **App:** Title of the application.
2.   **Translated_Review:** It contains the English translation of the review.
3.   **Sentiment:** It gives the emotion like ‘Positive’, ‘Negative’, or ‘Neutral’.
4.   **Sentiment_Polarity:** It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
5.    **Sentiment_Subjectivity:** This value gives how close a reviewers opinion is to the opinion of the general public. Its range is[0,1].

In [None]:
# using the info of user_reviews_data
user_reviews_data.info()

In [None]:
# creating a new empty dataframe to new variable
user_reviews_data_new = pd.DataFrame(index=user_reviews_data.columns)

# now adding datatype, not_null, and null columns to user_reviews_data_new
user_reviews_data_new["DataType"] = user_reviews_data.dtypes
user_reviews_data_new["not_null"] = user_reviews_data.count()
user_reviews_data_new["null"] = user_reviews_data.isnull().sum()

# printing the dataframe
user_reviews_data_new

In [None]:
# finding the NaN value of Sentiment_Polarity column in user_reviews_data
user_reviews_data[(user_reviews_data['Sentiment_Polarity'].isnull())]

In [None]:
# counting the values of Sentiment_Polarity from user_reviews_data
user_reviews_data['Sentiment_Polarity'].value_counts()

Generally we are seeing all of the columns with "Sentiment Polarity" has null values so and they are categorical values so we droping all the null values from datasets.

In [None]:
# droping all null values from user_reviews_data
user_reviews_data.dropna(inplace = True)

In [None]:
# time to check is there any null values now
user_reviews_data[(user_reviews_data['Sentiment_Polarity'].isnull())]

In [None]:
# now time to check the data as little cleaned or not
user_reviews_data.head(20)

**This is perfect for now**

### What all manipulations have you done and insights you found?

Answer Here.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 2

In [None]:
# Chart - 2 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 3

In [None]:
# Chart - 3 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 4

In [None]:
# Chart - 4 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 5

In [None]:
# Chart - 5 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 6

In [None]:
# Chart - 6 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 7

In [None]:
# Chart - 7 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 8

In [None]:
# Chart - 8 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 9

In [None]:
# Chart - 9 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 10

In [None]:
# Chart - 10 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 11

In [None]:
# Chart - 11 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 12

In [None]:
# Chart - 12 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 13

In [None]:
# Chart - 13 visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

##### 1. Why did you pick the specific chart?

Answer Here.

##### 2. What is/are the insight(s) found from the chart?

Answer Here

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here.

# **Conclusion**

Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***