![Cover Photo](image.jpg)

# **1.0 About Author**
- **Project:** Google Play Store Apps - EDA
- **Author:** Faizan Ahmad
- **Code Submission Date:** June 12th, 2024
  
**Author's Contact Info:**
[Email](ma143faizan@gmail.com),
[Github](https://github.com/fitfaizan),
[Kaggle](https://www.kaggle.com/virtualcrush),
[Linkedin](https://www.linkedin.com/in/fitfaizan)

# **2.0 About Data**
- **Google Play Store Apps** (Web scraped data of 10k Play Store apps for analysing the Android market).
- **Data Age:** Updated 5 years ago.
- **Dataset:** 🔗 [*link*](https://www.kaggle.com/datasets/lava18/google-play-store-apps/data?select=googleplaystore.csv)
- **Context:**
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.
- **Content**
Each app (row) has values for catergory, rating, size, and more.
- **Acknowledgements**
This information is scraped from the Google Play Store. This app information would not be available without it.


## **2.1 Task:**
We intend to conduct an Exploratory Data Analysis (EDA) on the given dataset. The EDA will serve as the basis for the necessary Data Wrangling activities to be carried out for the purposes of data cleaning and normalization. During the coding process, we will document our observations. Ultimately, we will produce a summary and draw conclusions from our findings

## **2.2 Objectives:**
The primary aim of this project is to conduct a thorough analysis of the dataset to identify significant insights. The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

## **2.3 Kernel Version Used:**
- Python 3.11.5

# **3.0 *Import Libraries***
- We will use the follwoing libraries
    1. Warnings: To avoid any warning messages.
    2. Pandas: Data manipulation and analysis library.
    3. Numpy: Numerical computing library.
    4. Matplotlib: Data visualization library.
    5. Seaborn: Statistical data visualization library.


In [1]:
# to avoid warnings
import warnings
warnings.filterwarnings('ignore')

# importing libraries for data manipulation
import pandas as pd
import numpy as np

# importing libraries for visualization
import seaborn as sns
import matplotlib.pyplot as plt


# **4.0  Data Loading, Exploration & Wrangling**

## **4.1 Load the csv file with the pandas:**

In [2]:
# loading data from csv file
df = pd.read_csv("./google_play_apps/googleplaystore.csv")

## **4.2 Creating the dataframe and understanding the data present in the dataset. (Getting a sneak peek of data):** 
With just a few lines of code! Quickly view the top and bottom rows of dataset to get a sense of what you're working with, without having to scroll through the entire file.

In [3]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [4]:
df.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19M,"10,000,000+",Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


> **Note**: Some the output of notebook does not present the complete output, therefore we can increase the limit of columns view and row view by using these commands: 


This code snippet helps us get a complete overview of the data by adjusting a key display setting in Pandas. By setting the option to show all columns, we can ensure that no valuable information is overlooked when working with dataframes. Whether exploring data, conducting analyses or simply trying to get a better sense of it, this simple line of code can make a big difference.

In [5]:
pd.set_option('display.max_columns', None) # this is to display all the columns in the dataframe
pd.set_option('display.max_rows', None) # this is to display all the rows in the dataframe

## **4.3 Analyzing & describing the dataset:**

### **4.3.1 View the `.info()` of data:**
This code snippet provides a quick summary of the DataFrame, including the number of non-null entries in each column, the data type of each column, and the memory usage of the DataFrame. Using the `.info()` method is essential for getting a concise overview of the dataset, helping us identify missing values, understand the structure of data, and prepare for further data cleaning and analysis tasks.

In [6]:
# using the info command to check the information of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [7]:
# to get unique values of each column in the dataset
cols = df.columns
for col in cols:
    print(col, ":", df[col].unique())
    print("\n")

App : ['Photo Editor & Candy Camera & Grid & ScrapBook' 'Coloring book moana'
 'U Launcher Lite – FREE Live Cool Themes, Hide Apps' ...
 'Parkinson Exercices FR' 'The SCP Foundation DB fr nn5n'
 'iHoroscope - 2018 Daily Horoscope & Astrology']


Category : ['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION' '1.9']


Rating : [ 4.1  3.9  4.7  4.5  4.3  4.4  3.8  4.2  4.6  3.2  4.   nan  4.8  4.9
  3.6  3.7  3.3  3.4  3.5  3.1  5.   2.6  3.   1.9  2.5  2.8  2.7  1.
  2.9  2.3  2.2  1.7  2.   1.8  2.4  1.6  2.1  1.4  1.5  1.2 19. ]


Reviews : ['159' '967' '87510' ... '603' '1195' '3

### **Observation Set 1:**
- There are total of 13 columns and 10841 rows in the dataset. 
- It seems like there is one particular row that is missing for all of the columns.
- 
**Column Names are:**
*'App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
'Android Ver'*

## **4.5 Descriptive Statistics:**
We use descriptive statistics to summarize and understand the key features of dataset.

In [None]:
# df.describe()
# df.describe(include='all')

### **Observation Set 2:**
1. We have 7 numeric columns in the original dataset
2. Column Size_bytes contains data in Bytes. --> We will add one column to hold the size_bytes data in MB format

## **4.6 Missing values in the data:**

In [None]:
# df.isnull().sum().sort_values(ascending=False)

# **5.0 Exploratory Analysis and Visualization**

In [None]:
# plt.rcParams['figure.figsize'] = (15,6)
# sns.heatmap(df.isnull(),yticklabels = False, cbar = False , cmap = 'viridis')
# plt.title("Missing null values")

> **Figure-1:** Provide us the visual on the missing values in a dataframe 'df'

#### Get a clearer picture of missing data with this nifty code snippet! See the percentage of null values in your dataset sorted in ascending order, making it easy to identify which features have the most missing data.

In [None]:
# #df.isnull().sum()/len(df)*100
# missing_percentage = (df.isnull().sum().sort_values(ascending = False)/len(df))*100
# missing_percentage

### Milestone 1: We have cleaned the dataset from null values 🙂

> Next, Find duplications and Analyse them if its a valid DUPLICATION

### Milestone 2: *Hence no duplicates found* 👥

Although App_name may have duplications, the records are unique as a whole, with varying versions, AppIds, and release dates.

# **6.0 Question and Aswers:**           
> We are going to pose following questions against the dataset:

1. What are the top 10 Categories that are installed from the Apple Store?
2. What are the  highest top 10 rated primary_genre based on  Average_User_Rating
3. Which Primary_Genre has the highest count of Paid and Free apps?
4. What are the Top 5 Paid Apps based with highest ratings?
5. What are the Top 5 Free Apps based With highest ratings?
6. Apps with highest content rating
7. Years in which max apps were released
8. Size in MBs Vs Price of App
9. Top 10 app producing developer
10. What type of Genre attracted what kind of clintele in terms of revenues?
11. YoY (Year on Year) comparison of apps per Content_Rating
12. User Rating vs Price
13. User Rating vs MBytes
14. Year on Year break down of top-5 Genre based on App Price
15. IOS Versions Vs Count of app
16. Interdependency of numeric attributes on each other

> **Figure-3:** Shows the Top 10 Categories that are installed from the Apple Store 
- Answer 1: Gaming Apps are the most downloaded apps from the store

- Answer 10: 
> Few interesting points that needs further analysis and attention. What kind of apps that falls under the category of Business & Utilities and are bought by or for Children

### Facts:
- ⚠️ Caution: The content rating attribute may not always provide reliable information.
- 🚸 Be aware: An app rated 9+ could contain inappropriate content, such as an e-book with mature themes.

# **7.0 Summary**

The EDA exercise conducted on the Apple App Store dataset has yielded numerous interesting insights. The dataset was found to be relatively clean and consistent throughout the analysis. We posed several questions to the dataset and provided detailed answers and findings as follows:


Q1. What are the top 10 categories with the most downloads from the Apple Store?\
A1: Gaming apps are the most downloaded apps from the store.

Q2. What are the top 10 primary genres with the highest average user rating?\
A2: The highest rated primary genres based on average user rating are Weather, Games, Photo & Videos, Music, Books, and References, among others.

Q3. Which primary genre has the highest count of paid and free apps?\
A3: The list of free apps includes Games, Business, Education, Utilities, and Lifestyle, while the list of paid apps includes Education, Games, Utilities, Stickers, and Productivity.

Q4. What are the top 5 paid apps with the highest ratings?\
A4: The top 5 paid apps with the highest ratings are Super Nano Trucks, FarRock Dodgeball, Money Easy - Expense Tracker, Money Flow - Expense Tracker, and Sketch Ideas.

Q5. What are the top 5 free apps with the highest ratings?\
A5: The top 5 free apps with the highest ratings are Rise of Zombie - City Defense, Dog Wheelchairs, Dog App - Breed Scanner, Dojo Login 2, and Dojo Hero.

Q6. What are the apps with the highest content rating?\
A6: The apps with the highest content rating are for children, adults, teens, and everyone, with a breakdown based on the count of apps under each category.

Q7. In which years were the most apps released?\
A7: The year 2020 saw the highest number of app releases, likely due to the COVID-19 pandemic and more people staying at home.

Q8. How does the size of an app in MBs compare to its price?\
A8: We tried to find a correlation between app size and price but found that, except for a few exceptions, the size of the app is irrelevant to the price.

Q9. Who are the top 10 app-producing developers?\
A9: The top 10 app-producing developers are ChowNow, Touch2Success, Alexander Velimirovic, MINDBODY, Incorporated, Phorest, OFFLINE MAP TRIP GUIDE LTD, Magzter Inc., ASK Video, RAPID ACCELERATION INDIA PRIVATE LIMITED, and Nonlinear Educating Inc.

Q10. What types of genres attract which types of clients in terms of revenue?\
A10: Some interesting points need further analysis and attention, such as what kind of Business & Utilities apps are bought by or for children. It was found that the information stored in the content rating attribute is not very reliable and that there is a possibility that an app is rated for 9+ but is actually an e-book with unsuitable topics.

Q11. How do app releases per content rating compare year on year?\
A11: Clearly, the children's category is taking the lead, but as pointed out earlier, the content rating of non-kids apps is also rated under the children's category.

Q12. How does user rating compare to price?\
A12: The higher the user rating, the higher the price of the app.

Q13. How does user rating compare to MB size?\
A13: The higher the user rating, the larger the size in MB of the app.

Q14. How do YoY breakdowns per genre based on app price compare?\
A14: Educational apps contribute more revenue in terms of app sales.

Q16. Interdependency of numeric attributes on each other\
Answer 16: Few attributes show strong dependence on each other, except for Average_User_Rating and Current_Version_Score.

---
---

# **8.0 Conclusion & Findings**

#### The primary goal of this project is to analyze the Apple App Store dataset and identify insights based on the data. By doing so, we aim to project customer dynamics and demands to developers and relevant stakeholders, helping them generate more business for their upcoming applications.


> During this EDA exercise, we have achieved several milestones:

- We have cleaned the dataset from null values.
- No duplications have been found. Although the app names have duplications, they are unique records with different versions, AppIds, and release dates.

> Our findings include:

- Gaming apps are the most downloaded apps from the store.
- The top 10 highest rated primary genres based on average user ratings are Weather, Games, Photo & Videos, Music, and Books.
  
It is important to note that the information stored in the content rating attribute may not always be reliable. There is a possibility that an app rated for 9+ may contain unsuitable topics, such as e-books.