<a href="https://colab.research.google.com/github/ankii7546/Play-Store-App-Review-Analysis/blob/main/Capstone_Project_Play_Store_App_Review_Analysis_Individual.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Google Play Store Apps Exploratory Data Analysis (EDA)**

**Introduction**

Google Play Store is a digital distribution service developed and operated by Google. It is an official apps store that provides variety content such as apps, books, magazines, music, movies and television programs. It serves an as platform to allow users with 'Google certified' Android operating system devices to download applications developed and published on the platform either with a charge or free of cost. With the rapidly growth of Android devices and apps, it would be interesting to perform data analysis on the data to obtain valuable insights.

<b> The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. </b>



In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# importing libraries
import numpy as np
import pandas as pd

In [3]:
#load the play store and user review data in Pandas dataframe
store_df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/AlmaBetter/Capstone/EDA/Play Store Data.csv")
review_df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/AlmaBetter/Capstone/EDA/User Reviews.csv")

In [5]:
#look at first 5 records of Play Store data
store_df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [6]:
#look at a random record of Play Store data
store_df.sample()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
4658,Waiting For U Launcher Theme,PERSONALIZATION,4.4,5599,4.8M,"500,000+",Free,0,Everyone,Personalization,"July 23, 2016",2.0.40,2.3 and up


**Description of App Dataset columns**

App : The name of the app

Category : The category of the app

Rating : The avaerage rating of the app out of 5, in the Play Store

Reviews : The number of reviews of the app

Size : The size of the app

Install : The total number of installs of the app

Type : The type of the app (Free/Paid)

The price of the app (0 if it is Free)

Content Rating :The appropiate target audience of the app

Genres: The genre of the app

Last Updated : The date when the app was last updated

Current Ver : The current version of the app

Android Ver : The minimum Android version required to run the app

In [8]:
#getting basic info about play store data
store_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [9]:
#we see that some of the column have missing values lets count total sum of null values
store_df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

**Data Preparation and Cleaning**

Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data, and the combining of data sets to enrich data. Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a recordset, table, or database and refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.

From above information we see that the dataset contains many Null or missing values. The column Rating, Type , Content Rating , Current Ver , and Android Ver contains null values.

In [None]:
df.columns

In [None]:
df1.head()

In [None]:
df.describe()

In [None]:
df.head()

In [None]:
five_rating=df[df['Rating']==5.0]

In [None]:
five_rating.describe()

In [None]:
pie_data=five_rating['Category'].value_counts().head()
colors=['peachpuff','silver','lightcyan','powderblue','thistle']
pie_data=pie_data.rename('.')


In [None]:
pie_data.plot(kind='pie',autopct='%1.0f%%',colors=colors,title="Percentage of top 5 five star rated categories")

In [None]:
# Rating,Installs,Reviews

In [None]:
df['Installs'].unique()

In [None]:
df.info()

In [None]:
def remove_plus(value):
  
  if(value[-1]=='+'):
    #print(type(value))
    value=value[:-1]
    value=int(value.replace(",",""))
    #print(type(value))
    
    
    
  else:
    value=0
  
  
  return value

    


df['Installs']=df['Installs'].apply(lambda x: remove_plus(x) )


In [None]:
df['Installs']

In [None]:
def convert_int(value):

  if(value[-1] in ['M']):
    value=int(value.strip('.0M'))*1000000
    #print(value)
    

 
  else:
    value=int(value)

  return value


df['Reviews']=df['Reviews'].apply(lambda x: convert_int(x) )

In [None]:
df['Reviews']

In [None]:
df.info()

In [None]:
corr_df=df[['Installs','Rating','Reviews']].corr()

In [None]:
import seaborn as sns
sns.heatmap(corr_df,cmap='rocket_r',  annot=True)