# Google Play Store Dataset - Data wrangling

Dataset from [Kaggle](https://www.kaggle.com/datasets/lava18/google-play-store-apps?select=googleplaystore.csv).

**When analysing data, I believe understanding the data at hand and looking at it thoroughly is important to ensure the realiability of the results from any analysis performed. This script walks through the process of getting a dataset, understanding it, and cleaning it for further analysis.**

## Table of contents:
* [Import relevant libraries]
* [Read data](#read)
    * [Google play store](#df1)
    * [User reviews](#df2)
    * [Merged data](#df3)
* [Functions to clean columns](#functions)
* [Cleaning and checking each column](#cleancols)
    * [Category](#category)
    * [Rating](#rating)
    * [Reviews](#reviews)
    * [Size](#size)
    * [Installs](#installs)
    * [Type](#type)
    * [Price](#price)
    * [Content Rating](#content)
    * [Genres](#genres)
    * [Last Updated](#lastupdated)
    * [Current Ver](#currentver)
    * [Android Ver](#androidver)
    * [Clean data export](#export)
        

### Import relevant libraries <a class="anchor" id="import"></a>

In [129]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import re as re

### Read data <a class="anchor" id="read"></a>

In [130]:
df1 = pd.read_csv('googleplaystore.csv')
df2 = pd.read_csv('googleplaystore_user_reviews.csv')

This is our df1 <a class="anchor" id="df1"></a>

In [131]:
df1.info()
df1.isnull().sum()
df1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


This is our df2 <a class="anchor" id="df2"></a>

In [132]:
df2.info()
df2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [133]:
df2.isnull().sum()

App                           0
Translated_Review         26868
Sentiment                 26863
Sentiment_Polarity        26863
Sentiment_Subjectivity    26863
dtype: int64

Check why so many null values

In [134]:
df2.sample(20)

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
9115,Arrow.io,Re fix bug go wall goes hangggggg,Neutral,0.0,0.0
56101,Google News,,,,
41725,Extreme Coupon Finder,Love store policy along deals. I used yet stor...,Positive,0.976562,0.6
31558,Czech Public Transport IDOS,"Once, it occurred to me that the application s...",Neutral,0.0,0.6
40037,English Grammar Complete Handbook,,,,
37798,Duolingo: Learn Languages Free,"It's good, adverts annoy me; I used before. Al...",Positive,0.445455,0.518182
11017,BaBe - Baca Berita,,,,
24267,CarMax – Cars for Sale: Search Used Car Inventory,,,,
59814,Hangouts,,,,
42183,FINAL FANTASY BRAVE EXVIUS,"It's ok first, gets old quick. After gets work...",Positive,0.121707,0.433683


There are some columns wiht NaN so we can get rid of those as they possibly are empty reviews

In [135]:
df2 = df2.dropna()
df2.sample(20)
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37427 entries, 0 to 64230
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     37427 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37427 non-null  object 
 3   Sentiment_Polarity      37427 non-null  float64
 4   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.7+ MB


Merge both dfs <a class="anchor" id="df3"></a>

In [136]:
df3 = pd.merge(df1,df2, on = 'App', indicator=True, how = 'outer')

In [137]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83715 entries, 0 to 83714
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   App                     83715 non-null  object  
 1   Category                82217 non-null  object  
 2   Rating                  80705 non-null  float64 
 3   Reviews                 82217 non-null  object  
 4   Size                    82217 non-null  object  
 5   Installs                82217 non-null  object  
 6   Type                    82216 non-null  object  
 7   Price                   82217 non-null  object  
 8   Content Rating          82216 non-null  object  
 9   Genres                  82217 non-null  object  
 10  Last Updated            82217 non-null  object  
 11  Current Ver             82209 non-null  object  
 12  Android Ver             82214 non-null  object  
 13  Translated_Review       74103 non-null  object  
 14  Sentiment             

Checking if merge went well, randomly selected an App to check

In [138]:
df_test1 = df1[df1['App'] == 'Coloring book moana']
df_test1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2033,Coloring book moana,FAMILY,3.9,974,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up


In [139]:
df_test2 = df2[df2['App'] == 'Coloring book moana']
df_test2.info()
df_test2.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 28538 to 28594
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     44 non-null     object 
 1   Translated_Review       44 non-null     object 
 2   Sentiment               44 non-null     object 
 3   Sentiment_Polarity      44 non-null     float64
 4   Sentiment_Subjectivity  44 non-null     float64
dtypes: float64(2), object(3)
memory usage: 2.1+ KB


Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
28538,Coloring book moana,A kid's excessive ads. The types ads allowed a...,Negative,-0.25,1.0
28539,Coloring book moana,It bad >:(,Negative,-0.725,0.833333
28540,Coloring book moana,like,Neutral,0.0,0.0
28542,Coloring book moana,I love colors inspyering,Positive,0.5,0.6
28543,Coloring book moana,I hate,Negative,-0.8,0.9


In [140]:
df_test3 = df3[df3['App'] == 'Coloring book moana']
df_test3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88 entries, 1 to 88
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   App                     88 non-null     object  
 1   Category                88 non-null     object  
 2   Rating                  88 non-null     float64 
 3   Reviews                 88 non-null     object  
 4   Size                    88 non-null     object  
 5   Installs                88 non-null     object  
 6   Type                    88 non-null     object  
 7   Price                   88 non-null     object  
 8   Content Rating          88 non-null     object  
 9   Genres                  88 non-null     object  
 10  Last Updated            88 non-null     object  
 11  Current Ver             88 non-null     object  
 12  Android Ver             88 non-null     object  
 13  Translated_Review       88 non-null     object  
 14  Sentiment               88 n

It looks like the merge duplicated the number of rows, will now check that

In [141]:
df_test3.duplicated()

1     False
2     False
3     False
4     False
5     False
      ...  
84     True
85     True
86     True
87     True
88     True
Length: 88, dtype: bool

In [142]:
df3 = df3.drop_duplicates(keep = 'last')
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51135 entries, 0 to 83714
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   App                     51135 non-null  object  
 1   Category                49693 non-null  object  
 2   Rating                  48191 non-null  float64 
 3   Reviews                 49693 non-null  object  
 4   Size                    49693 non-null  object  
 5   Installs                49693 non-null  object  
 6   Type                    49692 non-null  object  
 7   Price                   49693 non-null  object  
 8   Content Rating          49692 non-null  object  
 9   Genres                  49693 non-null  object  
 10  Last Updated            49693 non-null  object  
 11  Current Ver             49685 non-null  object  
 12  Android Ver             49690 non-null  object  
 13  Translated_Review       41856 non-null  object  
 14  Sentiment             

Looks like we got rid of the duplicates now

In [143]:
df3.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity,_merge
59826,BBC Sport,SPORTS,4.2,18679,Varies with device,"1,000,000+",Free,0,Everyone,Sports,"April 25, 2018",Varies with device,Varies with device,Will open general sport page . To get open spo...,Positive,0.0125,0.625,both
79968,DM for WhatsApp,COMMUNICATION,4.4,25,2.9M,"5,000+",Free,0,Everyone,Communication,"June 3, 2018",0.3.7,4.4 and up,,,,,left_only
51049,Diabetes:M,MEDICAL,4.6,15545,Varies with device,"100,000+",Free,0,Everyone,Medical,"August 1, 2018",6.1.3,Varies with device,This best Medication logging / Diabetic I ever...,Positive,0.45625,0.614286,both
75833,💎 I'm rich,LIFESTYLE,3.8,718,26M,"10,000+",Paid,$399.99,Everyone,Lifestyle,"March 11, 2018",1.0.0,4.4 and up,,,,,left_only
56583,Blur Image Background Editor (Blur Photo Editor),PHOTOGRAPHY,4.4,62421,9.6M,"5,000,000+",Free,0,Everyone,Photography,"July 28, 2018",2.4,4.1 and up,It's simple easy use!,Positive,0.270833,0.595238,both
12405,Edmodo,FAMILY,4.1,200214,18M,"10,000,000+",Free,0,Everyone,Education,"August 6, 2018",9.12.6,4.0.3 and up,I FREAKING LOVE THIS SO MUCH I CANT EVEN STAND...,Positive,0.375,0.4,both
65251,Flashlight & LED Torch,TOOLS,4.3,111507,Varies with device,"10,000,000+",Free,0,Everyone,Tools,"December 29, 2017",Varies with device,Varies with device,Alright,Neutral,0.0,0.0,both
51944,Facebook,SOCIAL,4.1,78128208,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device,This new community standar stupid. They banned...,Negative,-0.221212,0.484848,both
69198,Dropbox,PRODUCTIVITY,4.4,1860844,61M,"500,000,000+",Free,0,Everyone,Productivity,"August 1, 2018",Varies with device,Varies with device,Works I need work,Neutral,0.0,0.0,both
44725,Alto's Adventure,GAME,4.6,515240,63M,"10,000,000+",Free,0,Everyone,Action,"June 5, 2018",1.7.1,4.0 and up,Would done 5 stars across board werent small t...,Positive,0.027778,0.322222,both


It is clear by the last column that some Apps only had info in df1, and some only had info in df2. I have decided to leave it like that for now because we do not want to lose data from df1 as some analyses will only need df1.

Some more data cleaning is necessary. Let's look at the data for each column and see what needs to be cleanned. I will define two functions: one to look at string columns, and one to look at number columns. <a class="anchor" id="functions"></a>

In [144]:
def str_info_column(df,col_name):
    col = df[col_name]
    info = {
        'name': col_name,
        'unique values': col.unique(),
        'value counts': col.value_counts()
    }
    return info 

In [145]:
def n_info_column(df,col_name):
    col = df[col_name]
    info = {
        'name': col_name,
        'min': col.min(),
        'max': col.max(),
        'value counts':col.value_counts()
    }
    return info 

Let us go column by column now <a class="anchor" id="cleancols"></a>

Category column <a class="anchor" id="category"></a>

In [146]:
str_info_column(df3,'Category') 

{'name': 'Category',
 'unique values': array(['ART_AND_DESIGN', 'FAMILY', 'AUTO_AND_VEHICLES', 'BEAUTY',
        'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMMUNICATION', 'COMICS',
        'DATING', 'TOOLS', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS',
        'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'MEDICAL',
        'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME',
        'SPORTS', 'VIDEO_PLAYERS', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY',
        'TRAVEL_AND_LOCAL', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING',
        'WEATHER', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9', nan],
       dtype=object),
 'value counts': GAME                   10247
 FAMILY                  5398
 TOOLS                   2450
 HEALTH_AND_FITNESS      2150
 PRODUCTIVITY            2015
 TRAVEL_AND_LOCAL        1949
 SPORTS                  1921
 FINANCE                 1894
 PHOTOGRAPHY             1802
 MEDICAL                 1679
 DATING                  1638
 COMMUNICATION           1

There is a strange '1.9', let's investigate that.

In [147]:
df3[df3['Category'] == '1.9']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity,_merge
81857,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,,,,,,left_only


Looks like this is possibly an error, so we will replace it with a nan

In [148]:
df3['Category'] = df3['Category'].replace('1.9', None)

In [149]:
df3.Category.unique()

array(['ART_AND_DESIGN', 'FAMILY', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMMUNICATION', 'COMICS',
       'DATING', 'TOOLS', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS',
       'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'MEDICAL',
       'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME',
       'SPORTS', 'VIDEO_PLAYERS', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY',
       'TRAVEL_AND_LOCAL', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING',
       'WEATHER', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', nan],
      dtype=object)

Rating column <a class="anchor" id="rating"></a>

In [150]:
n_info_column(df3,'Rating')

{'name': 'Rating',
 'min': 1.0,
 'max': 19.0,
 'value counts': 4.4     7381
 4.5     7080
 4.3     6670
 4.6     6240
 4.2     5262
 4.1     3600
 4.7     3012
 4.0     2325
 3.9     1746
 3.8     1023
 3.7      751
 4.8      586
 3.5      441
 3.4      359
 3.6      346
 5.0      271
 3.1      185
 4.9      182
 3.0      143
 3.3      138
 3.2      102
 2.7       60
 2.6       54
 2.9       45
 2.8       40
 2.5       20
 2.3       20
 2.4       19
 1.0       16
 2.2       14
 2.0       12
 1.9       12
 1.7        8
 1.8        8
 2.1        8
 1.6        4
 1.4        3
 1.5        3
 1.2        1
 19.0       1
 Name: Rating, dtype: int64}

is it possible to have 19 as max in Rating? 

It looks like this number is an error. It looks like this row is the same row we had before with '1.9'as the category. Looking closely one can see that all entries are slightly off - possibly when scraping something went wrong with the Category and all observations are misplaced in the wrong collumn.  I'll get rid of this row.

In [151]:
df3 = df3[df3.Rating < 5]

Reviews column <a class="anchor" id="reviews"></a>

In [152]:
n_info_column(df3,'Reviews')

{'name': 'Reviews',
 'min': '1',
 'max': '9992',
 'value counts': 78158306    130
 78128208    130
 13791       126
 1842381     124
 1841061     124
            ... 
 7529865       1
 93726         1
 597068        1
 823109        1
 398307        1
 Name: Reviews, Length: 5992, dtype: int64}

It looks like pandas is not reading this collumn as a number collumn, but rather as text. this could get tricky in some analyses. So let's just convert that to float.

In [153]:
df3['Reviews'] = df3['Reviews'].astype(float) 

So when we re run the min and max we get numbers now!

Size column <a class="anchor" id="size"></a>

In [154]:
n_info_column(df3,'Size')

{'name': 'Size',
 'min': '1.0M',
 'max': 'Varies with device',
 'value counts': Varies with device    14999
 11M                     942
 97M                     937
 14M                     861
 24M                     840
                       ...  
 442k                      1
 842k                      1
 412k                      1
 459k                      1
 619k                      1
 Name: Size, Length: 406, dtype: int64}

it would be useful if we could have a way to transform these strings into numbers so we can do some analyses in the future - for example, does size influence the decision do donwload the app?

In [155]:
df3['Size'] = df3['Size'].str.replace('M', '000000', regex=True) 

In [156]:
df3['Size'] = df3['Size'].str.replace('k', '00000', regex=True) 

In [157]:
df3['Size'] = df3['Size'].replace('Varies with device', None)

In [158]:
df3['Size'] = df3['Size'].astype(float) 

Installs column <a class="anchor" id="installs"></a>

In [159]:
df3['Installs'] = df3['Installs'].str.replace('+', '', regex=True) 
df3['Installs'] = df3['Installs'].str.replace(',', '', regex=True) 

In [160]:
df3['Installs'] = df3['Installs'].astype(int) 

Type column <a class="anchor" id="type"></a>

In [161]:
n_info_column(df3,'Type')

{'name': 'Type',
 'min': 'Free',
 'max': 'Paid',
 'value counts': Free    46975
 Paid      944
 Name: Type, dtype: int64}

Price column <a class="anchor" id="price"></a>

In [162]:
n_info_column(df3,'Price')

{'name': 'Price',
 'min': '$0.99',
 'max': '0',
 'value counts': 0          46975
 $0.99        153
 $3.99        124
 $2.99        108
 $4.99        103
            ...  
 $1.59          1
 $6.49          1
 $1.29          1
 $299.99        1
 $1.20          1
 Name: Price, Length: 71, dtype: int64}

In [163]:
df3['Price'] = df3['Price'].str.replace('$', '', regex=True) 

In [164]:
df3['Price'] = df3['Price'].astype(float) 

Content rating column <a class="anchor" id="content"></a>

In [165]:
n_info_column(df3,'Content Rating')

{'name': 'Content Rating',
 'min': 'Adults only 18+',
 'max': 'Unrated',
 'value counts': Everyone           37288
 Teen                6275
 Mature 17+          2418
 Everyone 10+        1900
 Adults only 18+       37
 Unrated                1
 Name: Content Rating, dtype: int64}

Genres column <a class="anchor" id="genres"></a>

In [166]:
str_info_column(df3,'Genres')

{'name': 'Genres',
 'unique values': array(['Art & Design', 'Art & Design;Pretend Play',
        'Art & Design;Creativity', 'Auto & Vehicles', 'Beauty',
        'Books & Reference', 'Business', 'Communication', 'Comics',
        'Comics;Creativity', 'Dating', 'Tools', 'Education;Education',
        'Education', 'Education;Creativity', 'Education;Music & Video',
        'Education;Action & Adventure', 'Education;Pretend Play',
        'Education;Brain Games', 'Entertainment',
        'Entertainment;Music & Video', 'Entertainment;Brain Games',
        'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
        'Health & Fitness', 'Medical', 'House & Home', 'Libraries & Demo',
        'Lifestyle', 'Lifestyle;Pretend Play',
        'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
        'Card;Brain Games', 'Puzzle;Brain Games', 'Casual;Pretend Play',
        'Action', 'Strategy', 'Puzzle', 'Sports', 'Music', 'Word',
        'Racing', 'Casual;Creativity', 'Casual;Ac

It looks like the Genre 'Education' is appearing twice because there is a space in one of the instances. Also, there is "educational". So let's fix that and have only one 'Education'. 

In [167]:
df3['Genres'] = df3['Genres'].str.replace('Education ', 'Education', regex=True) 
df3['Genres'] = df3['Genres'].str.replace('Educational', 'Education', regex=True) 

we could also split the ones with more than one genre into two columns

In [168]:
df3[['Genre1', 'Genre2']] = df3['Genres'].str.split(';', expand=True)

In [169]:
df3.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity,_merge,Genre1,Genre2
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19000000.0,10000,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,,,,,left_only,Art & Design,
23,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,A kid's excessive ads. The types ads allowed a...,Negative,-0.25,1.0,both,Art & Design,Pretend Play
24,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,It cute.,Positive,0.5,1.0,both,Art & Design,Pretend Play
25,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,It bad >:(,Negative,-0.725,0.833333,both,Art & Design,Pretend Play
26,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,like,Neutral,0.0,0.0,both,Art & Design,Pretend Play


Last Updated column <a class="anchor" id="lastupdated"></a>

In [170]:
n_info_column(df3,'Last Updated')

{'name': 'Last Updated',
 'min': 'April 1, 2016',
 'max': 'September 9, 2016',
 'value counts': July 31, 2018        3331
 August 3, 2018       2464
 August 2, 2018       2199
 August 1, 2018       2046
 August 6, 2018       1383
                      ... 
 November 16, 2015       1
 March 28, 2016          1
 April 17, 2014          1
 April 11, 2016          1
 March 23, 2014          1
 Name: Last Updated, Length: 1288, dtype: int64}

In [171]:
df3['Last Updated'].sample(20)

27360        July 17, 2018
64311       August 3, 2018
46175        June 29, 2018
60060        June 11, 2018
18278       August 2, 2018
23468         May 29, 2018
693         August 3, 2018
61805         June 9, 2018
33225        July 31, 2018
4580        August 2, 2018
31689        July 10, 2018
38457        July 12, 2018
68111       August 6, 2018
30280        April 9, 2018
62558       August 2, 2018
75993    December 21, 2017
54716       August 3, 2018
64338       August 3, 2018
63628        July 10, 2018
49537        July 16, 2018
Name: Last Updated, dtype: object

all good with this column

Current Ver column <a class="anchor" id="currentver"></a>

In [172]:
str_info_column(df3,'Current Ver')

{'name': 'Current Ver',
 'unique values': array(['1.0.0', '2.0.0', '1.2.4', ..., '1.5.447', '1.0.612928', '0.3.4'],
       dtype=object),
 'value counts': Varies with device    12199
 1.0.6                   675
 4.0.0                   674
 1.0                     579
 7.9.3                   535
                       ...  
 17.4.11                   1
 6.7.15.7                  1
 1.8.7.0                   1
 5.1.10                    1
 0.3.4                     1
 Name: Current Ver, Length: 2608, dtype: int64}

In [173]:
df3['Current Ver'].sample(20)

23909                 1.2.0
75030    Varies with device
18647    Varies with device
5448           3.2.0.100171
15781          5.10.1.40699
31971             10.322.16
8107                 1.31.4
70754    Varies with device
17975                 5.3.8
11702                20.7.2
43882                1.1.40
81584                   1.3
56559    Varies with device
53873                  4.89
78371                   4.8
7404                  1.639
43905                1.1.40
1001                  1.1.0
37034                2.1.10
36254                2.21.5
Name: Current Ver, dtype: object

all good with thsi column!

Android Ver column <a class="anchor" id="androidver"></a>

In [174]:
df3['Android Ver'].value_counts()

Varies with device    11444
4.1 and up            10763
4.0.3 and up           7147
4.4 and up             4437
4.0 and up             3747
5.0 and up             3153
2.3 and up             2129
4.2 and up             1542
3.0 and up              676
2.3.3 and up            655
4.3 and up              563
6.0 and up              416
2.2 and up              382
2.1 and up              304
1.6 and up              152
7.0 and up              110
3.2 and up               65
2.0 and up               58
1.5 and up               53
7.1 and up               42
4.0.3 - 7.1.1            33
5.1 and up               15
3.1 and up                8
2.0.1 and up              6
4.4W and up               5
8.0 and up                4
5.0 - 8.0                 3
1.0 and up                2
7.0 - 7.1.1               1
4.1 - 7.1.1               1
5.0 - 6.0                 1
Name: Android Ver, dtype: int64

In [175]:
df3['Android Ver'].sample(20)

76113          4.0.3 and up
11790            2.3 and up
53324    Varies with device
14823            4.4 and up
69881    Varies with device
74741          4.0.3 and up
13558    Varies with device
12505    Varies with device
5720     Varies with device
64878    Varies with device
33199          4.0.3 and up
55380            4.1 and up
24675            4.4 and up
70528    Varies with device
28606            4.1 and up
53503            4.4 and up
79446            1.6 and up
2495             4.1 and up
75524            4.1 and up
74645          4.0.3 and up
Name: Android Ver, dtype: object

all good with this column!

Clean data export Category column <a class="anchor" id="export"></a>

In [177]:
df3.to_excel("clean-data.xlsx") 