<a href="https://colab.research.google.com/github/aviralsingh2907/PlayStore-App-Review-Analysis/blob/main/EDA_Play_Store_App_Review_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <b><u>Exploratory Data Analysis</u> - <u>Play Store App Review Analysis</u></b>

##### **Project Type**    - EDA
##### **Contribution**    - Individual
##### **Team Member**     - Aviral Singh

# <b><u>Project Summary</u>:</b>

# **<u>GitHub Link</u>:**

# <b><u>Problem Statement</u>:</b>

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market.

Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.

Explore and analyze the data to discover key factors responsible for app engagement and success.

# <b><u>Let's Begin</u>:</b>

## <b><u>Dataset Import</u>:</b>

In [None]:
# Importing package:

import pandas as pd
import numpy as np
from google.colab import drive
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns


drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Importing Play Store dataset:

playstore_df_main = pd.read_csv('/content/drive/MyDrive/Portfolio Project/EDA - Play Store App Review/Play Store Data.csv')
user_review_df_main = pd.read_csv('/content/drive/MyDrive/Portfolio Project/EDA - Play Store App Review/User Reviews.csv')

In [None]:
# Creating copies:

playstore_df = playstore_df_main.copy()
user_review_df = user_review_df_main.copy()

In [None]:
playstore_df.head(5)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [None]:
user_review_df.head(5)

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [None]:
# Checking the columns present in the datasets:

playstore_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [None]:
user_review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


In [None]:
# Null values check:

playstore_df.isnull().sum()

In [None]:
user_review_df.isnull().sum()

## <b><u>Data Cleaning</u>:</b>

Cleaning and converting raw data before processing and analysis is known as data preparation. Prior to processing, it is a crucial phase that frequently entails reformatting data, correcting data, and integrating data sets to enrich data. It involves determining which parts of the data are incomplete, incorrect, inaccurate, or irrelevant, and then replacing, changing, or deleting the soiled or coarse data.

Wouldn't it be better if we could construct a function to obtain more valuable information about the various dataset attributes? There is also another good reason to define a function because it will be reusable and we will use it often in the future.

In [None]:
# Defining the function:

def complete_info():
    null = pd.DataFrame(index = play_store_df.columns)
    null['data_type'] = play_store_df.dtypes
    null['null_count'] = play_store_df.isnull().sum()
    null['unique_count'] = play_store_df.nunique()
    return null

In [None]:
# Calling the function:

complete_info()

### <b><u>'App' column</u>:</b>

We can now begin the process of cleaning the data. Let's start with the column "App" and see if there are any duplicate values present.

In [None]:
print("Number of Unique App names = ", len(play_store_df['App'].unique()))
print("NUMBER of the Total App name = ", play_store_df.shape[0])
print("Duplicate Apps = ",len(play_store_df['App'])- len(play_store_df['App'].unique()))

In [None]:
#Examining the appearance of the duplicate value
play_store_df[play_store_df['App'] == 'Coloring book moana']

In [None]:
play_store_df.drop_duplicates(subset = 'App', keep = 'first' ,inplace = True)

### <b><u>'Type' column</u>:</b>

In [None]:
# check for unique set of values in the column 'type'
play_store_df.Type.unique()

There is 0 and null value, let's change them to free.

In [None]:
#changing the values to 'free'
play_store_df['Type'].replace(to_replace=['0'], value=['Free'],inplace=True)
play_store_df['Type'].fillna('Free', inplace=True)

### <b><u>'Content Rating' column</u>:</b>

In [None]:
#checking for null values
play_store_df[play_store_df['Content Rating'].isnull()]

In [None]:
#comparing the data with the nearby rows.
play_store_df.loc[10465:10477, :]

As can be seen from the command above, all of the current column values are being replaced with their previous columns in row 10472 due to a missing value in the category column.

In [None]:
#Dropping the row containing null values in the column 'Content Rating'
play_store_df.dropna(subset = ['Content Rating'], inplace=True)

### Rating column:

The Rating column, which has a total of 1463 missing values, can now be fixed. Replacing the missing values with the Modevalue of that entire column.

In [None]:
#Finding the mode value and replacing with the null values present.
modeValueRating = play_store_df['Rating'].mode()
print(f' The mode value is: {modeValueRating[0]}')
play_store_df['Rating'].fillna(value=modeValueRating[0], inplace = True)
complete_info()

We have a few unnecessary columns that won't be very helpful when we're doing the analysis. Let's eliminate those columns.

In [None]:
#Eliminatting the collumns that are not necessary.
play_store_df.drop(['Current Ver','Last Updated', 'Android Ver'], axis=1, inplace=True)

In [None]:
complete_info()

The null count for every column is zero, indicating that there are no longer any missing entries in the data frame.

## Data Preparation:

The datatype for columns like Reviews, Size, Installs, and price should be an int or float, but we can see that they are of object type here. Let's change them to the appropriate type.

### Reviews

In [None]:
#converting the column Reviews to type 'int'
play_store_df['Reviews'] = play_store_df.Reviews.astype(int)

### Size

It also includes the values "Varies with device", 'M' and 'k' which are present in the column 'size' as string in the value type. The KB and MB size scales would be messed up if the 'M' and 'K' were left out. Consequently, we convert KBs to MBs. Below is a method of doing this.

In [None]:
#Removing the +Symbol:
play_store_df['Size'] = play_store_df.Size.apply(lambda x: x.strip('+'))

In [None]:
#Removing the , symbol:
play_store_df['Size'] = play_store_df.Size.apply(lambda x: x.replace(',', ''))

In [None]:
#removing M and k from values, also coverting KB into MB
play_store_df['Size'] =play_store_df['Size'].apply(lambda x: x.replace('M', '') if 'M' in str(x) else x )
play_store_df['Size'] = play_store_df['Size'].apply(lambda x: float(x.replace('k', ''))/1024 if 'k' in str(x) else x)

In [None]:
#Replacing the Varies with device value with Nan :
play_store_df['Size'] = play_store_df.Size.replace('Varies with device', np.NaN)

We need to do something with the set of Nan values data since we converted the Varies with device value to Nan. Since some apps' sizes will be too huge and others excessively small, it would be best to remove the rows of the column Size that contain Nanvalues rather than attempt to replace them with mean or mode.

In [None]:
# Removing the rows which containing "Varies with device"
play_store_df.dropna(subset = ['Size'], inplace=True)

In [None]:
#Renaming the column with appropiate name.
play_store_df.rename(columns={'Size': 'Size(in MB)'}, inplace=True)

In [None]:
#Now, finally converting all these values to numeric type:
play_store_df['Size(in MB)'] = pd.to_numeric(play_store_df['Size(in MB)'])

### Installs

In [None]:
#check for unique values
play_store_df.Installs.unique()

I will now convert this column to a float. The "," value needs to be changed, and the "+" sign needs to be removed.

In [None]:
#Removing the "+" sign and changing the sign ","
play_store_df['Installs'] = play_store_df.Installs.apply(lambda x: x.strip('+'))
play_store_df['Installs'] = play_store_df.Installs.apply(lambda x: x.replace(',', ''))

In [None]:
#convert it from string type to numeric type,
play_store_df['Installs'] = pd.to_numeric(play_store_df['Installs'])

### Price

In [None]:
#checking for value count
play_store_df['Price'].value_counts()

In [None]:
#Removing "$" sign
play_store_df['Price'] = play_store_df.Price.apply(lambda x: x.strip('$'))

In [None]:
# converting to Numeric type
play_store_df['Price'] = pd.to_numeric(play_store_df['Price'])

Lets take a final look at our DataFrame

In [None]:
#calling the function
complete_info()

In [None]:
play_store_df.shape

There are 8434 rows and 10 columns left in the data frame after the dataset has been cleaned up by removing any unneeded rows and columns containing Null Values and garbage data.

## Getting some summrization of dataset bassed on:

* Total size oocupied by each category of apps
* Average rating for each category of apps.
* Total installs for each category of apps.
* Total reviews for each category of apps.

In [None]:
categorically_summerization= play_store_df.groupby('Category').agg({'Size(in MB)':'sum', 'Rating':'mean', 'Installs':'sum','Reviews':'sum'})
categorically_summerization

## Data Analysis & Visualization

### Category

Well, let us try to find what are the top categories in the play store, which contains the highest number of apps?

In [None]:
#unique categories
len(play_store_df['Category'].unique())

So we got 33 category on this dataset, let's see which one is the famous category

In [None]:
# Determining top categories in data
x = play_store_df['Category'].value_counts().index
y = play_store_df['Category'].value_counts()
xaxis = []
yaxis = []
for i in range(len(y)):
    xaxis.append(x[i])
    yaxis.append(y[i])

In [None]:
# Plotting graph/visuals for the same
plt.figure(figsize=(18,5))
plt.xlabel("Category", fontsize = 15)
plt.ylabel("Count", fontsize = 15)
plt.xticks(rotation=90)
category_graph = sns.barplot(x = xaxis, y = yaxis, palette= "rainbow")
category_graph.set_title("Categories of apps in google Playstore", fontsize = 25);

There are a total of 33 categories in the dataset, and based on the result from the previous step, we can infer that the majority of the apps in the Google Play store fall into the Family & Games category, while the least number of them fall into the Beauty & Comics category.

In [None]:
#Finding the top 10 categories
Top10_categories=play_store_df['Category'].value_counts().reset_index().head(10)
Top10_categories.rename(columns={'index':'Category','Category':'Count'},inplace=True)
Top10_categories

In [None]:
#Plotting Distribution of top 10 categories
plt.figure(figsize=(8,10))
plt.pie(Top10_categories['Count'],labels=Top10_categories['Category'],autopct='%.0f%%',explode=[0.02]*10)
plt.title('Top 10 categories distribution', fontsize= 20)
plt.show()

Among Top 10 Categories Family (31%), Games (15%) And Tools (13%) Contribute The Most

### Rating

In [None]:
# ploting distribution graph for Rating
plt.figure(figsize=(14,5))
sns.distplot(play_store_df['Rating'],color = 'blue')
plt.grid()
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.title('Distribution Plot Of Rating')
plt.show()

In [None]:
#calculating the average rating
print('The average rating in the playstore is',play_store_df['Rating'].mean())

This average rating infers that maximum number of apps in playstore have been rated above 4 which signifies majority of apps available in playstore are of high calibre.

In [None]:
#App's With 5 Star Rating
five_star_rating_apps=play_store_df[play_store_df['Rating']==5]
five_star_rating_apps['App'].nunique()

In [None]:
#Top 10 Categories Of 5 Star Rating App's
five_star_rating_apps['Category'].value_counts().reset_index().rename(columns={'index':'Category','Category':'Count'}).head(10)

### Content Rating

Let us see, which category of Apps from the ‘Content Rating’ column is found more on the play store.

In [None]:
#Content rating value counts
value_c=play_store_df["Content Rating"].value_counts().reset_index()

In [None]:
#barplot of content rating value counts
sns.barplot(x="Content Rating",y="index",data=value_c)
plt.title("Barplot of Content Rating ",fontsize=20)
plt.xlabel("No. of apps", fontsize= 15)
plt.ylabel("Content rating", fontsize= 15)

The Everyone category has the most apps, as can be seen from the plot above.

### Categories Of Type-'Free' And 'Paid'

Paid apps vs Free apps

In [None]:
#plotting a graph between free and paid apps
plt.figure(figsize=(8,8))
labels = play_store_df['Type'].value_counts(sort = True).index
sizes = play_store_df['Type'].value_counts(sort = True)
colors = ["orange","blue"]
explode = (0.2,0)
plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', shadow=True, startangle=0)
plt.title('Percentage of Free Vs Paid Apps in store',size = 20)
plt.show()

We can see from the graph above that 92% of the apps in the Google Play store are free, while 8% are paid.

### Installs

Let us check, which category App’s have the most number of installs?

In [None]:
# defining x
x = play_store_df.groupby('Category')['Installs'].agg(np.sum)

In [None]:
# plotting line graph to determine category highest installations
plt.figure(figsize=(18,8))
plt.plot(x ,  color='blue', marker='.')
plt.xticks(rotation=90)
plt.xlabel('Categories---->')
plt.ylabel('Installs---->')
plt.title('Category vs Installs')
plt.grid()
plt.show()

Graph clearly shows that apps in game category have been installed the highest followed by apps in family category and then apps in travel and local.

In [None]:
# top 10 install app based on category
def top10incategory_installs(str):
    str = str.upper()
    top10 = play_store_df[play_store_df['Category'] == str]
    top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
    # Top_Apps_in_art_and_design
    plt.figure(figsize=(15,12))
    plt.title('Top 10 Installed Apps',size = 20);
    plt.xlabel('App', fontsize= 15)
    plt.ylabel('Installs',fontsize= 15)
    graph = sns.barplot(x = top10apps.App, y = top10apps.Installs)
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right');

After we are done with defining the function, it’s time to check and see if everything is working fine. So let’s test it by passing Game category to the above-defined function.

In [None]:
#calling the function
top10incategory_installs('Game')

Subway Surfers has the most instals in the Game category, as shown in the graph above. In the same way, we can obtain the top 10 installed apps by passing various category names to the function.

### Price

Let us Visualize, which are the top 10 expensive Apps in the play store?

In [None]:
# We will again need to create a separate data frame.
top10PaidApps_df = play_store_df[play_store_df['Type'] == 'Paid'].sort_values(by='Price', ascending=False).head(11)

In [None]:
#ploting the top 10 expensive App
plt.figure(figsize=(12,9));
plt.pie(top10PaidApps_df.Installs, explode=None, labels=top10PaidApps_df.App, autopct='%1.1f%%', startangle= -50);
plt.title('Top Expensive Apps Distribution',size = 20);
plt.legend(top10PaidApps_df.App,
           loc="top left",
           title="Apps",
           fontsize = "xx-small"
          );

From the above graph, we can interpret that the App 'I am rich Premium' is the most expensive app in the google play store followed by 'I am Rich'.

In [None]:
#Discription of feature Price
print('The mean price of an App in playstore is',play_store_df['Price'].mean())
print('The maximum price of an App in playstore is',play_store_df['Price'].max())
play_store_df['Price'].describe()

### Size

In [None]:
# Plotting a line graph to determine size distribution
plt.figure(figsize=(10,5))
plt.xlabel("Size(in MB)")
plt.ylabel("Frequency")
plt.grid()
size_distribution_graph = sns.kdeplot(play_store_df['Size(in MB)'], color="lightgreen", shade = True)
plt.title('Average size',size = 20);

In [None]:
print('The median size of an App in playstore is',play_store_df['Size(in MB)'].median())
print('The maximum size of an App in playstore is',play_store_df['Size(in MB)'].max())

In [None]:
# Plotting a boxplot graph to determine size distribution
plt.figure(figsize=(10,5))
plt.xlabel("Size(in MB)")
plt.ylabel("No of apps")
plt.grid()
size_distribution_graph = sns.boxplot(play_store_df['Size(in MB)'], color="lightgreen")
plt.title('Average size',size = 20);

As we can see from the box plot above,

* 25% of apps are under 5 MB
* 25% of apps are between 30 Mb to 63 Mb
* Majority i.e. 50% lies between 5 Mb to 30 Mb.
* Many outliers are all the way upto 100 Mb
* Median is 12 Mb

### Reviews

Apps which are highly reviewed

In [None]:
Apps_and_reviews= play_store_df.groupby('App')[['Reviews']].mean().sort_values('Reviews', ascending=False).head(10).reset_index()
sns.barplot(y = Apps_and_reviews['App'], x = Apps_and_reviews['Reviews'])
plt.title('Apps with highest ratings')
plt.show()

In [None]:
print("Number of Apps with more than 1M reviews",play_store_df[play_store_df['Reviews'] > 1000000].shape[0])
print("\nTop 20 apps with most reviews: \n",play_store_df[play_store_df['Reviews'] > 1000000].sort_values(by = 'Reviews', ascending = False).head(20)['App'])

### Genres

In [None]:
#top 10 Genres Value Counts
genres_count=play_store_df['Genres'].value_counts().reset_index()
genres_count.rename(columns={'index':'Genres','Genres':'count'},inplace=True)
top_10_genres=genres_count.head(10)

In [None]:
#Pie plot of Top 15 genres Count
plt.rcParams['figure.figsize'] = (20, 10)
plt.pie(top_10_genres['count'],labels=top_10_genres['Genres'],autopct='%.0f%%')
plt.title('Pie plot of Top 10 genres Count')
plt.grid()
plt.show()

Tools is the most used genre in apps.

### Correlation Heatmap

In [None]:
#Correlation
play_store_df.corr()

In [None]:
#Correlation Heatmap
plt.figure(figsize = (10,6))
sns.heatmap(play_store_df.corr(), annot= True)
plt.title("Correlation Heatmap",fontsize=20)

In [None]:
#Regression Plot Of Installs And Reviews
plt.figure(figsize = (10,6))
sns.regplot(x="Installs",y="Reviews",data=play_store_df)
plt.title("Regression Plot of Installs And Reviews",fontsize=25)
plt.grid()

## User Reviews Data

In [None]:
#First Look Of User Reviews Data
user_review.head()

In [None]:
user_review.info()

### Handling missing values

In [None]:
#data of Translated null values
user_review[user_review['Translated_Review'].isnull()].head()

In [None]:
#Dropping Nulls of Translated Review as all other feature values are also null
user_review.dropna(subset=['Translated_Review'],inplace=True)

In [None]:
#Checking data after removing nulls
user_review.info()

### Combining Both Datasets

In [None]:
#unique apps is user_reviews_data
user_review['App'].nunique()

In [None]:
#unique apps is play_store_data
play_store_df['App'].nunique()

In [None]:
#Merging both data
combined_data=pd.merge(play_store_df,user_review, on='App')

In [None]:
#Unique Apps in combined data
combined_data['App'].nunique()

In [None]:
#About combined data
combined_data.isnull().sum()

## Analysis Of Combined Data

### Sentiment

In [None]:
#Sentiment count
sentiment_count= combined_data.Sentiment.value_counts().reset_index().rename(columns={'index':'Sentiment','Sentiment':'count'})
sentiment_count

In [None]:
#Pieplot of Sentiment Count
plt.rcParams['figure.figsize'] = (15, 7)
plt.pie(sentiment_count['count'],labels=sentiment_count['Sentiment'],autopct='%.0f%%',explode=(0,0.05,0.05))
plt.title('Pie plot of Sentiment Count',size=15)
plt.show()

This chart interprets that most of the reviews are positive.

In [None]:
#Categoriwise Sentiment Count
categoriwise_sentiment_count=combined_data.groupby(['Category','Sentiment'])['Sentiment'].count().reset_index(name='count')
categoriwise_sentiment_count.head()

In [None]:
#Barplot of categoriwise_sentiment_count
sns.barplot(x="Category", y="count", hue="Sentiment", data=categoriwise_sentiment_count)
plt.xticks(rotation=90, horizontalalignment="center")
plt.title("Category-wise sentiment count",fontsize=20)
plt.xlabel("Category", fontsize= 15)
plt.ylabel("Sentiments", fontsize= 15)

This graph represents sentimnet analysis of the audience based on the categories, it can be observed that all the categories have more positive reviews than negatives.

# Conclusion

Joining the dots from given datasets, we came across a lot of information. After analysis and visualization, information can be classified into beneficial for customers and beneficial for developers. We discovered the top app categories, the most popular app categories, the proportion of free and paid apps, and the average app size. This information enables customers and users to make informed download decisions.

Game category has the highest engagement

• Thus, if fast growth is expected then introducing good quality game app of suitable size can work as a charm.

Opportunities

• Various apps category like Medical, food & drinks, Health & fitness and business & finances have positive sentimental advantage from users but the population of apps is less in those categories creating opportunities for new players