## EDA And Feature Engineering Of Google Play Store Dataset

1) Problem statement.
Today, 1.85 million different apps are available for users to download. Android users have even more from which to choose, with 2.56 million available through the Google Play Store. These apps have come to play a huge role in the way we live our lives today. Our Objective is to find the Most Popular Category, find the App with largest number of installs , the App with largest size etc.
2) Data Collection.

The data consists of 20 column and 10841 rows.

### Steps We Are Going to Follow
1. Data Clearning
2. Exploratory Data Analysis
3. Featur Engineering

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import re
import numpy as np

warnings.filterwarnings("ignore")

%matplotlib inline

# Data Cleaning

In [2]:
df=pd.read_csv('https://raw.githubusercontent.com/krishnaik06/playstore-Dataset/main/googleplaystore.csv')
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [4]:
# missing values
df.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [5]:
df['Reviews'].unique()

array(['159', '967', '87510', ..., '603', '1195', '398307'],
      shape=(6002,), dtype=object)

In [6]:
df['Reviews'].str.isnumeric().sum()

np.int64(10840)

In [7]:
df.shape

(10841, 13)

##### So one record is not numeric

In [8]:
df[~df['Reviews'].str.isnumeric()]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


##### since there is only 1 record with 3 million reviews we can directly change it

In [9]:
df.loc[~df['Reviews'].str.isnumeric(), 'Reviews'] = '3000000'

In [10]:
df['Reviews'].str.isnumeric().sum()

np.int64(10841)

##### so now all records can be converted to numeric value

In [11]:
df['Reviews']=df['Reviews'].astype(int)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), int64(1), object(11)
memory usage: 1.1+ MB


In [13]:
df['Size'].unique()

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '2.7M', '5.5M', '17M', '39M',
       '31M', '4.2M', '7.0M', '23M', '6.0M', '6.1M', '4.6M', '9.2M',
       '5.2M', '11M', '24M', 'Varies with device', '9.4M', '15M', '10M',
       '1.2M', '26M', '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k',
       '3.6M', '5.7M', '8.6M', '2.4M', '27M', '2.5M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '3.7M', '22M', '7.4M', '6.4M', '3.2M', '8.2M', '9.9M',
       '4.9M', '9.5M', '5.0M', '5.9M', '13M', '73M', '6.8M', '3.5M',
       '4.0M', '2.3M', '7.2M', '2.1M', '42M', '7.3M', '9.1M', '55M',
       '23k', '6.5M', '1.5M', '7.5M', '51M', '41M', '48M', '8.5M', '46M',
       '8.3M', '4.3M', '4.7M', '3.3M', '40M', '7.8M', '8.8M', '6.6M',
       '5.1M', '61M', '66M', '79k', '8.4M', '118k', '44M', '695k', '1.6M',
     

In [14]:
def convert_size_to_kb_float(size_str):
    """
    Converts a size string to a float representing the size in kilobytes (KB).
    Handles 'M', 'k', 'number+' and 'Varies with device' values.
    """
    if type(size_str)==float :### already converted
        return size_str
    if pd.isna(size_str) or size_str == 'Varies with device':
        return np.nan
    
    # Use regular expression to find the number and an optional unit (M, k, or +)
    match = re.match(r'(\d+\.?\d*)(\s*([Mk+]?))', size_str, re.I)
    if match:
        value = float(match.group(1))
        unit = match.group(3).lower() if match.group(3) else ''
        
        if unit == 'm':
            return value * 1024.0  # Convert MB to KB
        elif unit == 'k':
            return value
        elif unit == '+':
            # Assuming 'number+' means bytes, and 1 KB = 1024 bytes
            return value / 1024.0
        elif unit == '':
            # Handle cases with no unit, assuming they are bytes
            return value / 1024.0
            
    return np.nan

In [15]:
df['Size']=df['Size'].apply(convert_size_to_kb_float)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size            9146 non-null   float64
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(2), int64(1), object(10)
memory usage: 1.1+ MB


In [16]:
df['Installs']=df['Installs'].str.replace('+','')
df['Installs'].value_counts()

Installs
1,000,000        1579
10,000,000       1252
100,000          1169
10,000           1054
1,000             907
5,000,000         752
100               719
500,000           539
50,000            479
5,000             477
100,000,000       409
10                386
500               330
50,000,000        289
50                205
5                  82
500,000,000        72
1                  67
1,000,000,000      58
0                  15
Free                1
Name: count, dtype: int64

In [17]:
df.loc[df['Installs']=='Free']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3000000,0.000977,Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


##### this record contains many invalid values according to the filed so we will drop this record

In [18]:
df.drop(index=10472,inplace=True)
df.loc[df['Installs']=='Free']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver


In [19]:
df['Installs']=df['Installs'].str.replace(',','')
df['Installs']=df['Installs'].astype(int)

In [20]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10840 non-null  object 
 1   Category        10840 non-null  object 
 2   Rating          9366 non-null   float64
 3   Reviews         10840 non-null  int64  
 4   Size            9145 non-null   float64
 5   Installs        10840 non-null  int64  
 6   Type            10839 non-null  object 
 7   Price           10840 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10840 non-null  object 
 10  Last Updated    10840 non-null  object 
 11  Current Ver     10832 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(2), int64(2), object(9)
memory usage: 1.2+ MB


In [21]:
df.tail()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
10836,Sya9a Maroc - FR,FAMILY,4.5,38,54272.0,5000,Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3686.4,100,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9728.0,1000,Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,,1000,Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device
10840,iHoroscope - 2018 Daily Horoscope & Astrology,LIFESTYLE,4.5,398307,19456.0,10000000,Free,0,Everyone,Lifestyle,"July 25, 2018",Varies with device,Varies with device


In [22]:
df['Price'].value_counts()

Price
0          10040
$0.99        148
$2.99        129
$1.99         73
$4.99         72
           ...  
$3.61          1
$394.99        1
$1.26          1
$1.20          1
$1.04          1
Name: count, Length: 92, dtype: int64

In [None]:
df['Price']=df['Price'].str.replace('$','')
df['Price']=df['Price'].astype(float)
df=df.rename(columns={'Price': 'Price in Dollars($)'})

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   App                  10840 non-null  object 
 1   Category             10840 non-null  object 
 2   Rating               9366 non-null   float64
 3   Reviews              10840 non-null  int64  
 4   Size                 9145 non-null   float64
 5   Installs             10840 non-null  int64  
 6   Type                 10839 non-null  object 
 7   Price in Dollars($)  10840 non-null  float64
 8   Content Rating       10840 non-null  object 
 9   Genres               10840 non-null  object 
 10  Last Updated         10840 non-null  object 
 11  Current Ver          10832 non-null  object 
 12  Android Ver          10838 non-null  object 
dtypes: float64(3), int64(2), object(8)
memory usage: 1.2+ MB


In [27]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price in Dollars($),Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19456.0,10000,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14336.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8908.8,5000000,Free,0.0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25600.0,50000000,Free,0.0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2867.2,100000,Free,0.0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [28]:
df['Type'].value_counts()

Type
Free    10039
Paid      800
Name: count, dtype: int64

In [29]:
df['Type']=df['Type'].map(lambda x:0 if x=='Free' else 1)

In [30]:
df['Type'].value_counts()

Type
0    10039
1      801
Name: count, dtype: int64

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   App                  10840 non-null  object 
 1   Category             10840 non-null  object 
 2   Rating               9366 non-null   float64
 3   Reviews              10840 non-null  int64  
 4   Size                 9145 non-null   float64
 5   Installs             10840 non-null  int64  
 6   Type                 10840 non-null  int64  
 7   Price in Dollars($)  10840 non-null  float64
 8   Content Rating       10840 non-null  object 
 9   Genres               10840 non-null  object 
 10  Last Updated         10840 non-null  object 
 11  Current Ver          10832 non-null  object 
 12  Android Ver          10838 non-null  object 
dtypes: float64(3), int64(3), object(7)
memory usage: 1.2+ MB


In [32]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price in Dollars($),Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19456.0,10000,0,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14336.0,500000,0,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8908.8,5000000,0,0.0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25600.0,50000000,0,0.0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2867.2,100000,0,0.0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [33]:
df['Last Updated']=df['Last Updated'].str.replace(' ',',')

In [34]:
df['Last Updated']

0         January,7,,2018
1        January,15,,2018
2          August,1,,2018
3            June,8,,2018
4           June,20,,2018
               ...       
10836       July,25,,2017
10837        July,6,,2018
10838    January,20,,2017
10839    January,19,,2015
10840       July,25,,2018
Name: Last Updated, Length: 10840, dtype: object

In [49]:
df['Last Updated Month']=df['Last Updated'].str.split(',').str[0]
df['Last Updated Day']=df['Last Updated'].str.split(',').str[1]
df['Last Updated Year']=df['Last Updated'].str.split(',').str[3]

In [50]:
df.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price in Dollars($),Content Rating,Genres,Last Updated,Current Ver,Android Ver,Last Updated Month,Last Updated Day,Last Updated Year
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19456.0,10000,0,0.0,Everyone,Art & Design,"January,7,,2018",1.0.0,4.0.3 and up,January,7,2018
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14336.0,500000,0,0.0,Everyone,Art & Design;Pretend Play,"January,15,,2018",2.0.0,4.0.3 and up,January,15,2018
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8908.8,5000000,0,0.0,Everyone,Art & Design,"August,1,,2018",1.2.4,4.0.3 and up,August,1,2018
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25600.0,50000000,0,0.0,Teen,Art & Design,"June,8,,2018",Varies with device,4.2 and up,June,8,2018
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2867.2,100000,0,0.0,Everyone,Art & Design;Creativity,"June,20,,2018",1.1,4.4 and up,June,20,2018


In [51]:
df.drop('Last Updated',axis=1,inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   App                  10840 non-null  object 
 1   Category             10840 non-null  object 
 2   Rating               9366 non-null   float64
 3   Reviews              10840 non-null  int64  
 4   Size                 9145 non-null   float64
 5   Installs             10840 non-null  int64  
 6   Type                 10840 non-null  int64  
 7   Price in Dollars($)  10840 non-null  float64
 8   Content Rating       10840 non-null  object 
 9   Genres               10840 non-null  object 
 10  Current Ver          10832 non-null  object 
 11  Android Ver          10838 non-null  object 
 12  Last Updated Month   10840 non-null  object 
 13  Last Updated Day     10840 non-null  object 
 14  Last Updated Year    10840 non-null  object 
dtypes: float64(3), int64(3), object(9)
memory

In [52]:
months_to_numbers = {
    'January': 1,
    'February': 2,
    'March': 3,
    'April': 4,
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12
}

In [53]:
df['Last Updated Month']=df['Last Updated Month'].map(months_to_numbers)
df['Last Updated Day']=df['Last Updated Day'].astype(int)
df['Last Updated Year']=df['Last Updated Year'].astype(int)

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10840 entries, 0 to 10840
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   App                  10840 non-null  object 
 1   Category             10840 non-null  object 
 2   Rating               9366 non-null   float64
 3   Reviews              10840 non-null  int64  
 4   Size                 9145 non-null   float64
 5   Installs             10840 non-null  int64  
 6   Type                 10840 non-null  int64  
 7   Price in Dollars($)  10840 non-null  float64
 8   Content Rating       10840 non-null  object 
 9   Genres               10840 non-null  object 
 10  Current Ver          10832 non-null  object 
 11  Android Ver          10838 non-null  object 
 12  Last Updated Month   10840 non-null  int64  
 13  Last Updated Day     10840 non-null  int64  
 14  Last Updated Year    10840 non-null  int64  
dtypes: float64(3), int64(6), object(6)
memory

In [56]:
df.to_csv('data/playstor.csv')