# Exploratray Data Analysis on Google Play store app dataset

## About Dataset
**App**: The name of the app\
**Category**: The category of the app (e.g., social media, entertainment, productivity, etc)\
**Rating**: The average rating of the app (out of 5)\
**Reviews**: The number of reviews for the app\
**Size**: The size of the app in MB\
**Installs**: The number of installs for the app\
**Type**: The type of the app (e.g., free, paid, freemium)\
**Price**: The price of the app (if it's paid)\
**Content Rating**: The content rating of the app (e.g., Everyone, Teen, Mature)\
**Genres**: The genres of the app (e.g., action, adventure, puzzle, etc)\
**Last Updated**: The date when the app was last updated\
**Current Version**: The current version of the app\
**Android Version**: The minimum Android version required to run the app


In [48]:
# Import Libraries 
# Data
import numpy as np
import pandas as pd
from collections import defaultdict

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msn
from wordcloud import WordCloud

# Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# Regression
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor

# Classification
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Metrics
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error


# Hide warnings
import warnings
warnings.filterwarnings('ignore')

# display maximum columns and rows
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [49]:
# import dataset
df = pd.read_csv('./data/googleplaystore.csv')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10841 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10839 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


In [51]:
# cheak null values 
df.isnull().sum()

App                  0
Category             1
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       0
Genres               1
Last Updated         0
Current Ver          8
Android Ver          2
dtype: int64

### Cheaking each column why it have Null Values

In [52]:
# show row with null value in column type
df[df['Type'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
9148,Command & Conquer: Rivals,FAMILY,,0,Varies with device,0,,0,Everyone 10+,Strategy,"June 28, 2018",Varies with device,Varies with device


In [53]:
# unique values in type column
df['Type'].unique()

array(['Free', 'Paid', nan], dtype=object)

In [54]:
# replace NaN value in Type coloumn with Free
df['Type'] = df['Type'].fillna('Free')

In [55]:
df['Type'].unique()

array(['Free', 'Paid'], dtype=object)

In [56]:
df[df['Category'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,,1.9,19,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up


In [57]:
df['Category'].unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', nan],
      dtype=object)

In [58]:
# Replace null value with category as Photograpy
df['Category'] = df['Category'].fillna('PHOTOGRAPHY')

In [59]:
df['Category'].unique()

array(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION',
       'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE',
       'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME',
       'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL',
       'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL',
       'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER',
       'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'],
      dtype=object)

In [60]:
df[df['Genres'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,PHOTOGRAPHY,1.9,19,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up


In [61]:
df['Genres'].unique()

array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Art & Design;Action & Adventure',
       'Auto & Vehicles', 'Beauty', 'Books & Reference', 'Business',
       'Comics', 'Comics;Creativity', 'Communication', 'Dating',
       'Education;Education', 'Education', 'Education;Creativity',
       'Education;Music & Video', 'Education;Action & Adventure',
       'Education;Pretend Play', 'Education;Brain Games', 'Entertainment',
       'Entertainment;Music & Video', 'Entertainment;Brain Games',
       'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Lifestyle;Pretend Play',
       'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
       'Casual;Pretend Play', 'Action', 'Strategy', 'Puzzle', 'Sports',
       'Music', 'Word', 'Racing', 'Casual;Creativity',
       'Casual;Action & Adventure', 'Simulation', 'Adventure', 'Board',
       'Trivia', 'Role 

In [62]:
# Replace null values with 'Tools' 
df['Genres'] = df['Genres'].fillna('Tools')

In [63]:
df[df['Current Ver'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
15,Learn To Draw Kawaii Characters,ART_AND_DESIGN,3.2,55,2.7M,"5,000+",Free,0,Everyone,Art & Design,"June 6, 2018",,4.2 and up
1553,Market Update Helper,LIBRARIES_AND_DEMO,4.1,20145,11k,"1,000,000+",Free,0,Everyone,Libraries & Demo,"February 12, 2013",,1.5 and up
6322,Virtual DJ Sound Mixer,TOOLS,4.2,4010,8.7M,"500,000+",Free,0,Everyone,Tools,"May 10, 2017",,4.0 and up
6803,BT Master,FAMILY,,0,222k,100+,Free,0,Everyone,Education,"November 6, 2016",,1.6 and up
7333,Dots puzzle,FAMILY,4.0,179,14M,"50,000+",Paid,$0.99,Everyone,Puzzle,"April 18, 2018",,4.0 and up
7407,Calculate My IQ,FAMILY,,44,7.2M,"10,000+",Free,0,Everyone,Entertainment,"April 3, 2017",,2.3 and up
7730,UFO-CQ,TOOLS,,1,237k,10+,Paid,$0.99,Everyone,Tools,"July 4, 2016",,2.0 and up
10342,La Fe de Jesus,BOOKS_AND_REFERENCE,,8,658k,"1,000+",Free,0,Everyone,Books & Reference,"January 31, 2017",,3.0 and up


In [64]:
# Replace NaN Values in Coumn Current Ver with Mode 
df['Current Ver'] = df['Current Ver'].fillna(df['Current Ver'].mode()[0])

In [65]:
df[df['Android Ver'].isnull()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
4453,[substratum] Vacuum: P,PERSONALIZATION,4.4,230,11M,"1,000+",Paid,$1.49,Everyone,Personalization,"July 20, 2018",4.4,
4490,Pi Dark [substratum],PERSONALIZATION,4.5,189,2.1M,"10,000+",Free,0,Everyone,Personalization,"March 27, 2018",1.1,


In [66]:
# Replace NaN values in Column Adroid Ver with mode
df['Android Ver'] = df['Android Ver'].fillna(df['Android Ver'].mode()[0])

In [67]:
# drop nul values 
df.dropna(inplace=True)

In [68]:
df.isnull().sum()

App               0
Category          0
Rating            0
Reviews           0
Size              0
Installs          0
Type              0
Price             0
Content Rating    0
Genres            0
Last Updated      0
Current Ver       0
Android Ver       0
dtype: int64

# Feature Encoding

In [69]:
# Check the shape of the dataset
print(f'Dataset have {df.shape[0]} Rows and {df.shape[1]} columns')

Dataset have 9367 Rows and 13 columns


# Size Column

In [70]:
# unique vales in column size
df['Size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M',
       '11M', '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M',
       '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M',
       '5.7M', '8.6M', '2.4M', '27M', '2.5M', '7.0M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
       '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
       '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
       '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
       '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
       '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
      

In [71]:
# drop the row in column size have varries with devices string
df = df.drop(df[df['Size'].str.contains('Varies with device', na=False)].index)

In [72]:
# define a function which converts Values of Size 'K" with multipication 1024 and "M" with multiplication 1024*1024 and retrun Size column with folat dataset
def convert_size(size):
    if 'k' in size:
        return float(size.replace('k', '')) * 1024
    elif 'M' in size:
        return float(size.replace('M', '')) *1024*1024
    return size
 

In [73]:
# Apply covert_size funtion to Size column 
df['Size'] = df['Size'].apply(convert_size)

In [74]:
# check data type of Size column
df['Size'].dtype

dtype('float64')

In [75]:
# rename 'Size' Column with 'Size(mb)
df.rename(columns={'Size': 'Size(mb)'}, inplace=True)
# divide Size(mb) column with (1024*1024)
df['Size(mb)'] =df['Size(mb)'].apply(lambda x: x/(1024*1024))


In [76]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size(mb),Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19.0,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25.0,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [77]:
df.describe()

Unnamed: 0,Rating,Reviews,Size(mb)
count,7730.0,7730.0,7730.0
mean,4.173558,294634.4,22.954689
std,0.545141,1863110.0,23.445393
min,1.0,1.0,0.008301
25%,4.0,107.25,5.3
50%,4.3,2323.5,14.0
75%,4.5,38959.0,33.0
max,5.0,44893890.0,100.0


# Install Column

In [78]:
# check sum of unique values in Install column
df['Installs'].unique()

array(['10,000+', '500,000+', '5,000,000+', '50,000,000+', '100,000+',
       '50,000+', '1,000,000+', '10,000,000+', '5,000+', '100,000,000+',
       '1,000+', '500,000,000+', '100+', '500+', '10+', '1,000,000,000+',
       '5+', '50+', '1+'], dtype=object)

In [79]:
# Check Value counts in installs column
df['Installs'].value_counts()

Installs
1,000,000+        1302
100,000+          1037
10,000+            969
10,000,000+        825
1,000+             691
5,000,000+         535
500,000+           491
50,000+            437
5,000+             420
100+               303
100,000,000+       201
500+               197
50,000,000+        147
10+                 67
50+                 56
500,000,000+        30
1,000,000,000+      10
5+                   9
1+                   3
Name: count, dtype: int64

In [80]:
# Remove + and , signs from Column installs and convert data type to interger
df['Installs'] = df['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: int(x))

In [81]:
df['Installs'].value_counts()

Installs
1000000       1302
100000        1037
10000          969
10000000       825
1000           691
5000000        535
500000         491
50000          437
5000           420
100            303
100000000      201
500            197
50000000       147
10              67
50              56
500000000       30
1000000000      10
5                9
1                3
Name: count, dtype: int64

In [82]:
df.describe()

Unnamed: 0,Rating,Reviews,Size(mb),Installs
count,7730.0,7730.0,7730.0,7730.0
mean,4.173558,294634.4,22.954689,8416645.0
std,0.545141,1863110.0,23.445393,50135310.0
min,1.0,1.0,0.008301,1.0
25%,4.0,107.25,5.3,10000.0
50%,4.3,2323.5,14.0,100000.0
75%,4.5,38959.0,33.0,1000000.0
max,5.0,44893890.0,100.0,1000000000.0


# Price Column

In [83]:
# Check unique values in Price Column 
df['Price'].unique()

array(['0', '$4.99', '$6.99', '$7.99', '$3.99', '$5.99', '$2.99', '$1.99',
       '$9.99', '$0.99', '$9.00', '$5.49', '$10.00', '$24.99', '$11.99',
       '$79.99', '$16.99', '$14.99', '$29.99', '$12.99', '$3.49',
       '$10.99', '$7.49', '$1.50', '$19.99', '$15.99', '$33.99', '$39.99',
       '$2.49', '$4.49', '$1.70', '$1.49', '$3.88', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$1.59',
       '$6.49', '$1.29', '$299.99', '$379.99', '$37.99', '$18.99',
       '$389.99', '$8.49', '$1.75', '$14.00', '$2.00', '$3.08', '$2.59',
       '$19.40', '$15.46', '$8.99', '$3.04', '$13.99', '$4.29', '$3.28',
       '$4.60', '$1.00', '$2.90', '$1.97', '$2.56', '$1.20'], dtype=object)

In [84]:
df['Price'].value_counts()

Price
0          7151
$0.99       106
$2.99       101
$4.99        63
$1.99        53
$3.99        45
$1.49        28
$2.49        17
$9.99        16
$5.99        15
$399.99      11
$14.99       10
$6.99         9
$4.49         8
$7.99         7
$29.99        6
$3.49         6
$19.99        5
$24.99        5
$11.99        4
$12.99        4
$10.00        3
$16.99        3
$8.99         2
$1.00         2
$1.70         2
$17.99        2
$5.49         2
$33.99        2
$9.00         2
$10.99        2
$79.99        2
$39.99        1
$14.00        1
$2.00         1
$3.08         1
$2.59         1
$19.40        1
$15.46        1
$13.99        1
$3.04         1
$8.49         1
$4.29         1
$3.28         1
$4.60         1
$2.90         1
$1.97         1
$2.56         1
$1.75         1
$18.99        1
$389.99       1
$37.99        1
$15.99        1
$1.50         1
$7.49         1
$3.88         1
$400.00       1
$3.02         1
$1.76         1
$4.84         1
$4.77         1
$1.61         1
$1

In [85]:
# remove doller sign prom column Price
df['Price']=df['Price'].apply(lambda x: x.replace('$', ''))
df['Price']=df['Price'].apply(lambda x: float(x))

In [86]:
df.describe()

Unnamed: 0,Rating,Reviews,Size(mb),Installs,Price
count,7730.0,7730.0,7730.0,7730.0,7730.0
mean,4.173558,294634.4,22.954689,8416645.0,1.127468
std,0.545141,1863110.0,23.445393,50135310.0,17.400176
min,1.0,1.0,0.008301,1.0,0.0
25%,4.0,107.25,5.3,10000.0,0.0
50%,4.3,2323.5,14.0,100000.0,0.0
75%,4.5,38959.0,33.0,1000000.0,0.0
max,5.0,44893890.0,100.0,1000000000.0,400.0


# Last Updated Column

In [87]:
# Change last updated column to timedate formate
df['Last Updated']=pd.to_datetime(df['Last Updated'])

In [88]:
df['Updated_Month']=df['Last Updated'].dt.month
df['Updated_Year']=df['Last Updated'].dt.year

In [89]:
df.drop('Last Updated', axis=1, inplace=True)

In [47]:
df.describe()

Unnamed: 0,Rating,Reviews,Size(mb),Installs,Price,Updated_Month,Updated_Year
count,7730.0,7730.0,7730.0,7730.0,7730.0,7730.0,7730.0
mean,4.173558,294634.4,22.954689,8416645.0,1.127468,6.374515,2017.338034
std,0.545141,1863110.0,23.445393,50135310.0,17.400176,2.616339,1.161982
min,1.0,1.0,0.008301,1.0,0.0,1.0,2010.0
25%,4.0,107.25,5.3,10000.0,0.0,5.0,2017.0
50%,4.3,2323.5,14.0,100000.0,0.0,7.0,2018.0
75%,4.5,38959.0,33.0,1000000.0,0.0,8.0,2018.0
max,5.0,44893890.0,100.0,1000000000.0,400.0,12.0,2018.0


# Duplicate values

In [90]:
# Check duplicate values 
df.duplicated().sum()

306

In [None]:
# remove duplicate values 
