# Google Play Store Apps

File: https://www.kaggle.com/lava18/google-play-store-apps/downloads/googleplaystore.csv

This file contains all the details of the applications on Google Play. There are 13 features that describe a given app.


App = Application name
Category = Category the app belongs to ss
Rating = Overall user rating of the app (as when scraped)
Reviews = Number of user reviews for the app (as when scraped)
Size = Size of the app (as when scraped)
Installs = Number of user downloads/installs for the app (as when scraped)
Type = Paid or Free
Price = Price of the app (as when scraped)
Content Rating = Age group the app is targeted at - Children / Mature 21+ / Adult
Genres = An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
Last Updated = Date when the app was last updated on Play Store (as when scraped)
Current Ver = Current version of the app available on Play Store (as when scraped)
Android Ver = Min required Android version (as when scraped)

In [223]:
import pandas as pd
df = pd.read_csv("/Users/MacBookPro/Downloads/googleplaystore.csv")
print(df.describe())
df.head(20)

            Rating
count  9367.000000
mean      4.193338
std       0.537431
min       1.000000
25%       4.000000
50%       4.300000
75%       4.500000
max      19.000000


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19M,"50,000+",Free,0,Everyone,Art & Design,"April 26, 2018",1.1,4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29M,"1,000,000+",Free,0,Everyone,Art & Design,"June 14, 2018",6.1.61.1,4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33M,"1,000,000+",Free,0,Everyone,Art & Design,"September 20, 2017",2.9.2,3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3.1M,"10,000+",Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",2.8,4.0.3 and up


The describe function showed us something strange. We have 13 columns, many of which would be safe to assume they're numerical, but our describe function only shows up with 1 column statistic. Now we check the data types for all the columns.

In [224]:
#print(df.dtypes)
print(df.shape)
print(df.info()) # includes both count info and data type info

(10841, 13)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
None


Our suspicions are correct. Many of these columns are not the correct data type. Let's fix that.

In [225]:
#df['Reviews'] = df.Reviews.astype(float)

In [226]:
#found the error that was not letting my turn the entire column into a float type. Some value in the Reviews column is 3.0M

#df.loc[df['Reviews'] =='3.0M']['Reviews'] = 3.0
df.loc[df['Reviews'] == '3.0M', 'Reviews'] = 3.0

In [227]:
df['Reviews'] = df.Reviews.astype(float)

In [228]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

Alright, lets fix the rest of the columns.

In [229]:
#df.Size = df.Size.astype(float)

#realize that the values are all strings labeled in millions (M)

In [230]:
df.Size.head(20)

0      19M
1      14M
2     8.7M
3      25M
4     2.8M
5     5.6M
6      19M
7      29M
8      33M
9     3.1M
10     28M
11     12M
12     20M
13     21M
14     37M
15    2.7M
16    5.5M
17     17M
18     39M
19     31M
Name: Size, dtype: object

## Cleaning ['Size'] Data

I felt it necessary to write a longer post about this particular process. The size column needed to be broken down from being a string to a float. After some exploration, we discovered that the data was formatted as 3.0M, 3.0k, and 3,000+. 

In order to test our for loop, we first create a copy of df as df2, so that we could easily run the original dataframe over and over again as we test. This was originally an issue, because working on the original dataframe would clean some of the data, but not the rest.

With each error, we would test with the df2.unique() to see which types of values were not iterating correctly. 

Once all values were iterated correctly, we change the column ['Size'] into type float

In [231]:
df2 = df.copy()

In [232]:
for i in range(len(df2['Size'])):
    if type(df2.loc[i, 'Size']) == float:
        pass
    elif df2.loc[i,'Size'] == 'Varies with device':
        df2.loc[i, 'Size'] = None
    elif df2.loc[i,'Size'].endswith('k'):
        df2.loc[i, 'Size'] = df2.Size[i].replace("k", "")
        df2.loc[i, 'Size'] = df2.loc[i, 'Size'] = (float(df2.loc[i, 'Size']) * 1000)
    elif df2.loc[i,'Size'].endswith('+'):
        df2.loc[i, 'Size'] = df2.Size[i].replace("+", "")
        df2.loc[i, 'Size'] = df2.Size[i].replace(",", "")
        df2.loc[i, 'Size'] = df2.loc[i, 'Size'] = float(df2.loc[i, 'Size']) 
    elif df2.loc[i,'Size'].endswith('M'):
        df2.loc[i, 'Size'] = df2.Size[i].replace("M", "")
        df2.loc[i, 'Size'] = df2.loc[i, 'Size'] = (float(df2.loc[i, 'Size']) * 1000000)   
    else:
        df2.loc[i, 'Size'] = df2.Size[i].replace(",", "")
        df2.loc[i, 'Size'] = df2.loc[i, 'Size'] = float(df2.loc[i, 'Size'])  

# for i in range(len(df['Size'])):
#     if type(df.loc[i, 'Size']) == float:
#         pass
#     elif df.loc[i,'Size'].endswith('M'):
#         a = df.Size[i].replace("M", "")
#         df.loc[i, 'Size'] = df.loc[i, 'Size'] = (float(a) * 1000000)
#     elif df.loc[i,'Size'].endswith('k'):
#         a = df.Size[i].replace("k", "")
#         df.loc[i, 'Size'] = df.loc[i, 'Size'] = (float(a) * 1000)
#     elif df.loc[i,'Size'].endswith('+'):
#         a = df.Size[i].replace("+", "")
#         df.loc[i, 'Size'] = df.loc[i, 'Size'] = float(a)     
#     elif df.loc[i,'Size'] == 'Varies with device':
#         df.loc[i, 'Size'] = None
#     else:
#         pass

# def size_func(df):
#     for i in range(len(df['Size']-1)):
#         if type(df.loc[i, 'Size']) == float:
#             pass
#         elif df.loc[i,'Size'].endswith('M'):
#             a = df.Size[i].replace("M", "")
#             df.loc[i, 'Size'] = df.loc[i, 'Size'] = (float(df.loc[i, 'Size']) * 1000000)
#         elif df.loc[i,'Size'].endswith('k'):
#             a = df.Size[i].replace("k", "")
#             df.loc[i, 'Size'] = df.loc[i, 'Size'] = (float(df.loc[i, 'Size']) * 1000)
#         elif df.loc[i,'Size'].endswith('+'):
#             a = df.Size[i].replace("+", "")
#             df.loc[i, 'Size'] = df.loc[i, 'Size'] = float(df.loc[i, 'Size'])     
#         elif df.loc[i,'Size'] == 'Varies with device':
#             df.loc[i, 'Size'] = None
#         else:
#             pass

# df.apply(size_func)        
        
# df.Size.head(20)

In [233]:
# df.loc[0, 'Size'] = (float(df.loc[0, 'Size']) * 1000000)
# print df.loc[0,'Size']
df2.Size.unique

<bound method Series.unique of 0        1.9e+07
1        1.4e+07
2        8.7e+06
3        2.5e+07
4        2.8e+06
5        5.6e+06
6        1.9e+07
7        2.9e+07
8        3.3e+07
9        3.1e+06
10       2.8e+07
11       1.2e+07
12         2e+07
13       2.1e+07
14       3.7e+07
15       2.7e+06
16       5.5e+06
17       1.7e+07
18       3.9e+07
19       3.1e+07
20       1.4e+07
21       1.2e+07
22       4.2e+06
23         7e+06
24       2.3e+07
25         6e+06
26       2.5e+07
27       6.1e+06
28       4.6e+06
29       4.2e+06
          ...   
10811    3.9e+06
10812    1.3e+07
10813    2.7e+06
10814    3.1e+07
10815    4.9e+06
10816    6.8e+06
10817      8e+06
10818    1.5e+06
10819    3.6e+06
10820    8.6e+06
10821    2.5e+06
10822    3.1e+06
10823    2.9e+06
10824    8.2e+07
10825    7.7e+06
10826       None
10827    1.3e+07
10828    1.3e+07
10829    7.4e+06
10830    2.3e+06
10831    9.8e+06
10832     582000
10833     619000
10834    2.6e+06
10835    9.6e+06
10836    5.3e+07


In [234]:
#check datatypes
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [235]:
#change Size column to float
df['Size'] = df2.Size.astype(float)
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size              float64
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [236]:
#correcting 'Installs' column
for i in range(len(df2['Installs'])):
    if type(df2.loc[i, 'Installs']) == float:
        pass
    elif df2.loc[i,'Installs'].endswith('+'):
        df2.loc[i, 'Installs'] = df2.Installs[i].replace("+", "")
        df2.loc[i, 'Installs'] = df2.Installs[i].replace(",", "")
        df2.loc[i, 'Installs'] = df2.loc[i, 'Installs'] = float(df2.loc[i, 'Installs'])
    elif df2.loc[i,'Installs'] == 'Free':
        df2.loc[i, 'Installs'] = None
    else:
        df2.loc[i, 'Installs'] = df2.Installs[i].replace(",", "")
        df2.loc[i, 'Installs'] = df2.loc[i, 'Installs'] = float(df2.loc[i, 'Installs'])  


In [237]:
df2.Installs.unique()

array([10000.0, 500000.0, 5000000.0, 50000000.0, 100000.0, 50000.0,
       1000000.0, 10000000.0, 5000.0, 100000000.0, 1000000000.0, 1000.0,
       500000000.0, 50.0, 100.0, 500.0, 10.0, 1.0, 5.0, 0.0, None], dtype=object)

In [238]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size              float64
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [239]:
df['Installs'] = df2.Installs.astype(float)
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size              float64
Installs          float64
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [240]:
df2 = df.copy()

In [241]:
#correcting 'Price' column
for i in range(len(df2['Price'])):
    if type(df2.loc[i, 'Price']) == float:
        pass
    elif df2.loc[i,'Price'].startswith('$'):
        df2.loc[i, 'Price'] = df2.Price[i].replace("$", "")
#         df2.loc[i, 'Price'] = df2.Price[i].replace(",", "")
#         df2.loc[i, 'Price'] = df2.loc[i, 'Price'] = float(a)
    elif df2.loc[i,'Price'] == 'Everyone':
        df2.loc[i, 'Price'] = None
# #     else:
#         df2.loc[i, 'Price'] = df2.loc[i, 'Price'] = float(df2.loc[i, 'Price'])  

In [242]:
df2.Price.unique()

array(['0', '4.99', '3.99', '6.99', '1.49', '2.99', '7.99', '5.99', '3.49',
       '1.99', '9.99', '7.49', '0.99', '9.00', '5.49', '10.00', '24.99',
       '11.99', '79.99', '16.99', '14.99', '1.00', '29.99', '12.99',
       '2.49', '10.99', '1.50', '19.99', '15.99', '33.99', '74.99',
       '39.99', '3.95', '4.49', '1.70', '8.99', '2.00', '3.88', '25.99',
       '399.99', '17.99', '400.00', '3.02', '1.76', '4.84', '4.77', '1.61',
       '2.50', '1.59', '6.49', '1.29', '5.00', '13.99', '299.99', '379.99',
       '37.99', '18.99', '389.99', '19.90', '8.49', '1.75', '14.00',
       '4.85', '46.99', '109.99', '154.99', '3.08', '2.59', '4.80', '1.96',
       '19.40', '3.90', '4.59', '15.46', '3.04', '4.29', '2.60', '3.28',
       '4.60', '28.99', '2.95', '2.90', '1.97', '200.00', '89.99', '2.56',
       '30.99', '3.61', '394.99', '1.26', None, '1.20', '1.04'], dtype=object)

In [243]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size              float64
Installs          float64
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [244]:
df['Price'] = df2.Price.astype(float)
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size              float64
Installs          float64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [250]:
df2 = df.copy()

In [252]:
#Most dates are formatted as "January 1, 2018"

from datetime import datetime
import numpy as np

for i in range(len(df2['Last Updated'])):
    try:
        datetime.strptime(df2.loc[i, 'Last Updated'], '%B %d, %Y')
        df2.loc[i, 'Last Updated'] = datetime.strptime(df2.loc[i, 'Last Updated'], '%B %d, %Y')
#     elif datetime.strptime(df2.loc[i, 'Last Updated'], '%m.%d.%y'):
#         df2.loc[i, 'Last Updated'] = datetime.strptime(df2.loc[i, 'Last Updated'], '%m.%d.%y')
    except:
        df2.loc[i, 'Last Updated'] = np.nan
df2['Last Updated'].head()

0    2018-01-07 00:00:00
1    2018-01-15 00:00:00
2    2018-08-01 00:00:00
3    2018-06-08 00:00:00
4    2018-06-20 00:00:00
Name: Last Updated, dtype: object

In [None]:
df2['Last Updated'].unique()

In [253]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size              float64
Installs          float64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

In [256]:
df['Last Updated'] = df2['Last Updated'].astype(datetime)
df.dtypes

App                object
Category           object
Rating            float64
Reviews           float64
Size              float64
Installs          float64
Type               object
Price             float64
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object