# Google Play Store Dataset - Data wrangling

Dataset from [Kaggle](https://www.kaggle.com/datasets/lava18/google-play-store-apps?select=googleplaystore.csv).

**When analysing data, I believe understanding the data at hand and looking at it thoroughly is important to ensure the realiability of the results from any analysis performed. This script walks through the process of getting a dataset, understanding it, and cleaning it for further analysis.**

Import relevant libraries

In [279]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import re as re

Read data

In [193]:
df1 = pd.read_csv('googleplaystore.csv')
df2 = pd.read_csv('googleplaystore_user_reviews.csv')

In [194]:
df1.info()
df1.isnull().sum()
df1.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [195]:
df2.info()
df2.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB


Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


In [196]:
df2.isnull().sum()

App                           0
Translated_Review         26868
Sentiment                 26863
Sentiment_Polarity        26863
Sentiment_Subjectivity    26863
dtype: int64

Check why so many null values

In [197]:
df2.sample(20)

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
21604,Calorie Counter & Diet Tracker,Can't say enough this. This another weight iss...,Positive,0.233333,0.633333
22606,Camera FV-5 Lite,,,,
14204,Best New Ringtones 2018 Free 🔥 For Android™,,,,
4764,Agoda – Hotel Booking Deals,"Amazing app! Used book hotels, way price, Thai...",Positive,0.75,0.9
56490,Google Photos,This great photo gallery app. You back photos ...,Positive,0.4,0.375
13803,BeautyPlus - Easy Photo Editor & Selfie Camera,,,,
62716,Home Security Camera WardenCam - reuse old phones,Pretty intuitive controls. Great functionality...,Positive,0.363889,0.630556
8590,Apartments.com Rental Search,,,,
59237,HTC Sense Input,"I need bulgarian phonetic keyboard, traditiona...",Positive,0.1375,0.375
59625,Hairstyles step by step,Cute adorable also amazing,Positive,0.533333,0.966667


There are some columns wiht NaN so we can get rid of those as they possibly are empty reviews

In [198]:
df2 = df2.dropna()
df2.sample(20)
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37427 entries, 0 to 64230
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     37427 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37427 non-null  object 
 3   Sentiment_Polarity      37427 non-null  float64
 4   Sentiment_Subjectivity  37427 non-null  float64
dtypes: float64(2), object(3)
memory usage: 1.7+ MB


Merge both dfs

In [324]:
df3 = pd.merge(df1,df2, on = 'App', indicator=True, how = 'outer')

In [325]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83715 entries, 0 to 83714
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   App                     83715 non-null  object  
 1   Category                82217 non-null  object  
 2   Rating                  80705 non-null  float64 
 3   Reviews                 82217 non-null  object  
 4   Size                    82217 non-null  object  
 5   Installs                82217 non-null  object  
 6   Type                    82216 non-null  object  
 7   Price                   82217 non-null  object  
 8   Content Rating          82216 non-null  object  
 9   Genres                  82217 non-null  object  
 10  Last Updated            82217 non-null  object  
 11  Current Ver             82209 non-null  object  
 12  Android Ver             82214 non-null  object  
 13  Translated_Review       74103 non-null  object  
 14  Sentiment             

Checking if merge went well, randomly selected an App to check

In [326]:
df_test1 = df1[df1['App'] == 'Coloring book moana']
df_test1

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2033,Coloring book moana,FAMILY,3.9,974,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up


In [327]:
df_test2 = df2[df2['App'] == 'Coloring book moana']
df_test2.info()
df_test2.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 28538 to 28594
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     44 non-null     object 
 1   Translated_Review       44 non-null     object 
 2   Sentiment               44 non-null     object 
 3   Sentiment_Polarity      44 non-null     float64
 4   Sentiment_Subjectivity  44 non-null     float64
dtypes: float64(2), object(3)
memory usage: 2.1+ KB


Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
28538,Coloring book moana,A kid's excessive ads. The types ads allowed a...,Negative,-0.25,1.0
28539,Coloring book moana,It bad >:(,Negative,-0.725,0.833333
28540,Coloring book moana,like,Neutral,0.0,0.0
28542,Coloring book moana,I love colors inspyering,Positive,0.5,0.6
28543,Coloring book moana,I hate,Negative,-0.8,0.9


In [328]:
df_test3 = df3[df3['App'] == 'Coloring book moana']
df_test3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 88 entries, 1 to 88
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   App                     88 non-null     object  
 1   Category                88 non-null     object  
 2   Rating                  88 non-null     float64 
 3   Reviews                 88 non-null     object  
 4   Size                    88 non-null     object  
 5   Installs                88 non-null     object  
 6   Type                    88 non-null     object  
 7   Price                   88 non-null     object  
 8   Content Rating          88 non-null     object  
 9   Genres                  88 non-null     object  
 10  Last Updated            88 non-null     object  
 11  Current Ver             88 non-null     object  
 12  Android Ver             88 non-null     object  
 13  Translated_Review       88 non-null     object  
 14  Sentiment               88 n

It looks like the merge duplicated the number of rows, will now check that

In [329]:
df_test3.duplicated()

1     False
2     False
3     False
4     False
5     False
      ...  
84     True
85     True
86     True
87     True
88     True
Length: 88, dtype: bool

In [330]:
df3 = df3.drop_duplicates(keep = 'last')
df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51135 entries, 0 to 83714
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   App                     51135 non-null  object  
 1   Category                49693 non-null  object  
 2   Rating                  48191 non-null  float64 
 3   Reviews                 49693 non-null  object  
 4   Size                    49693 non-null  object  
 5   Installs                49693 non-null  object  
 6   Type                    49692 non-null  object  
 7   Price                   49693 non-null  object  
 8   Content Rating          49692 non-null  object  
 9   Genres                  49693 non-null  object  
 10  Last Updated            49693 non-null  object  
 11  Current Ver             49685 non-null  object  
 12  Android Ver             49690 non-null  object  
 13  Translated_Review       41856 non-null  object  
 14  Sentiment             

Looks like we got rid of the duplicates now

In [371]:
df3.sample(10)

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity,_merge
47833,Basketball FRVR - Shoot the Hoop and Slam Dunk!,FAMILY,4.5,4076.0,78000000.0,100000,Free,0,Everyone,Sports;Action & Adventure,"August 1, 2018",1.3.3,4.1 and up,I like want learn shooting.,Neutral,0.0,0.0,both
67224,Sun Rise Free Live Wallpaper,PERSONALIZATION,4.3,86481.0,8.6,10000000,Free,0,Everyone,Personalization,"January 31, 2018",4.8.3,4.0 and up,,,,,left_only
65004,DuraSpeed,TOOLS,3.8,5431.0,1.3,10000000,Free,0,Everyone,Tools,"November 13, 2017",1.5.0,6.0 and up,Slow phone micromax canvas 1,Negative,-0.3,0.4,both
27525,Bubble Shooter,GAME,4.5,148895.0,46000000.0,10000000,Free,0,Everyone,Casual,"July 17, 2018",1.20.1,4.0.3 and up,The game cool.,Negative,-0.025,0.525,both
67450,"GO Keyboard - Emoticon keyboard, Free Theme, GIF",PERSONALIZATION,4.4,2591941.0,6.2,100000000,Free,0,Everyone,Personalization,"July 20, 2018",Varies with device,Varies with device,VIP,Neutral,0.0,0.0,both
75536,Bad Piggies,FAMILY,4.3,1168959.0,66000000.0,50000000,Free,0,Everyone,Puzzle,"May 3, 2017",2.3.3,4.1 and up,The game's good road el porkador switches glit...,Negative,-0.06,0.32,both
66381,Diamond Zipper Lock Screen,PERSONALIZATION,4.3,71688.0,12000000.0,10000000,Free,0,Everyone,Personalization,"June 11, 2018",3.5,4.1 and up,Best lock screen ever,Positive,1.0,0.3,both
56972,"Cymera Camera- Photo Editor, Filter,Collage,La...",PHOTOGRAPHY,4.4,2418158.0,6.6,100000000,Free,0,Everyone,Photography,"July 12, 2018",Varies with device,Varies with device,Very useful,Positive,0.39,0.0,both
15662,Bualuang mBanking,FINANCE,4.0,48445.0,10000000.0,5000000,Free,0,Everyone,Finance,"July 16, 2018",2.6.0,4.0.3 and up,Why is that? Can I transfer numbers or Id? It'...,Positive,0.7,0.6,both
55732,Google Photos,PHOTOGRAPHY,4.5,10858538.0,28000000.0,1000000000,Free,0,Everyone,Photography,"August 6, 2018",Varies with device,Varies with device,"I installed ""Google Photos"" IOS devices Samsun...",Positive,0.175,0.325,both


It is clear by the last column that some Apps only had info in df1, and some only had info in df2. I have decided to leave it like that for now because we do not want to lose data from df1 as some analyses will only need df1.

Some more data cleaning is necessary. Let's look at the data for each column and see what needs to be cleanned. 

In [332]:
df3.Category.unique()

array(['ART_AND_DESIGN', 'FAMILY', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMMUNICATION', 'COMICS',
       'DATING', 'TOOLS', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS',
       'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'MEDICAL',
       'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME',
       'SPORTS', 'VIDEO_PLAYERS', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY',
       'TRAVEL_AND_LOCAL', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING',
       'WEATHER', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', '1.9', nan],
      dtype=object)

There is a strange '1.9', let's investigate that.

In [333]:
df3[df3['Category'] == '1.9']

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity,_merge
81857,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,,,,,,left_only


Looks like this is possibly an error, so we will replace it with a nan

In [334]:
df3['Category'] = df3['Category'].replace('1.9', None)

In [335]:
df3.Category.unique()

array(['ART_AND_DESIGN', 'FAMILY', 'AUTO_AND_VEHICLES', 'BEAUTY',
       'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMMUNICATION', 'COMICS',
       'DATING', 'TOOLS', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS',
       'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'MEDICAL',
       'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME',
       'SPORTS', 'VIDEO_PLAYERS', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY',
       'TRAVEL_AND_LOCAL', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING',
       'WEATHER', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION', nan],
      dtype=object)

Next, we look at the second ratings column

In [336]:
df3.Rating.min()

1.0

In [337]:
df3.Rating.max()

19.0

is it possible to have 19 as max in Rating? 

In [338]:
df3.Rating.value_counts()

4.4     7381
4.5     7080
4.3     6670
4.6     6240
4.2     5262
4.1     3600
4.7     3012
4.0     2325
3.9     1746
3.8     1023
3.7      751
4.8      586
3.5      441
3.4      359
3.6      346
5.0      271
3.1      185
4.9      182
3.0      143
3.3      138
3.2      102
2.7       60
2.6       54
2.9       45
2.8       40
2.5       20
2.3       20
2.4       19
1.0       16
2.2       14
2.0       12
1.9       12
1.7        8
1.8        8
2.1        8
1.6        4
1.4        3
1.5        3
1.2        1
19.0       1
Name: Rating, dtype: int64

Yes, so it looks like this number is an error. It looks like this row is the same row we had before with '1.9'as the category. Looking closely one can see that all entries are slightly off - possibly when scraping something went wrong with the Category and all observations are misplaced in the wrong collumn.  I'll get rid of this row.

In [360]:
df3 = df3[df3.Rating < 5]

now let's look at the Reviews column

In [361]:
df3.Reviews.min()

1.0

In [362]:
df3.Reviews.max()

78158306.0

It looks like pandas is not reading this collumn as a number collumn, but rather as text. this could get tricky in some analyses. So let's just convert that to float.

In [363]:
df3['Reviews'] = df3['Reviews'].astype(float) 

So when we re run the min and max we get numbers now!

now let's check the Size column...

In [343]:
df3.Size.min()

'1.0M'

In [344]:
df3.Size.max()

'Varies with device'

it would be useful if we could have a way to transform these strings into numbers so we can do some analyses in the future - for example, does size influence the decision do donwload the app?

In [345]:
df3['Size'] = df3['Size'].str.replace('M', '000000', regex=True) 

In [346]:
df3['Size'] = df3['Size'].str.replace('k', '00000', regex=True) 

In [348]:
df3['Size'] = df3['Size'].replace('Varies with device', None)

In [350]:
df3['Size'] = df3['Size'].astype(float) 

now let's clean the collumn Installs

In [356]:
df3['Installs'] = df3['Installs'].str.replace('+', '', regex=True) 
df3['Installs'] = df3['Installs'].str.replace(',', '', regex=True) 

In [357]:
df3['Installs'] = df3['Installs'].astype(int) 

next collumn is Type and it looks like this collumn is good to go for analysis

In [366]:
df3.Type.value_counts()

Free    46975
Paid      944
Name: Type, dtype: int64

next column is Price

In [378]:
df3.Price.min()

0.0

In [379]:
df3.Price.max()

400.0

In [373]:
df3['Price'] = df3['Price'].str.replace('$', '', regex=True) 

In [376]:
df3['Price'] = df3['Price'].astype(float) 

In [382]:
df3.Price.value_counts().sort_index()

0.00      46975
0.99        153
1.00          2
1.20          1
1.29          1
          ...  
299.99        1
379.99        1
389.99        1
399.99       11
400.00        1
Name: Price, Length: 71, dtype: int64

Next column is content rating and it looks like it's all good:

In [384]:
df3['Content Rating'].value_counts()

Everyone           37288
Teen                6275
Mature 17+          2418
Everyone 10+        1900
Adults only 18+       37
Unrated                1
Name: Content Rating, dtype: int64

Next column is Genres:

In [394]:
df3.Genres.unique()

array(['Art & Design', 'Art & Design;Pretend Play',
       'Art & Design;Creativity', 'Auto & Vehicles', 'Beauty',
       'Books & Reference', 'Business', 'Communication', 'Comics',
       'Comics;Creativity', 'Dating', 'Tools', 'Education;Education',
       'Education', 'Education;Creativity', 'Education;Music & Video',
       'Education;Action & Adventure', 'Education;Pretend Play',
       'Education;Brain Games', 'Entertainment',
       'Entertainment;Music & Video', 'Entertainment;Brain Games',
       'Entertainment;Creativity', 'Events', 'Finance', 'Food & Drink',
       'Health & Fitness', 'Medical', 'House & Home', 'Libraries & Demo',
       'Lifestyle', 'Lifestyle;Pretend Play',
       'Adventure;Action & Adventure', 'Arcade', 'Casual', 'Card',
       'Card;Brain Games', 'Puzzle;Brain Games', 'Casual;Pretend Play',
       'Action', 'Strategy', 'Puzzle', 'Sports', 'Music', 'Word',
       'Racing', 'Casual;Creativity', 'Casual;Action & Adventure',
       'Simulation', 'Adventure'

It looks like the Genre 'Education' is appearing twice because there is a space in one of the instances. Also, there is "educational". So let's fix that and have only one 'Education'. 

In [391]:
df3['Genres'] = df3['Genres'].str.replace('Education ', 'Education', regex=True) 
df3['Genres'] = df3['Genres'].str.replace('Educational', 'Education', regex=True) 

we could also split the ones with more than one genre into two columns

In [403]:
df3[['Genre1', 'Genre2']] = df3['Genres'].str.split(';', expand=True)

In [408]:
df3.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity,_merge,Genre1,Genre2
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159.0,19000000.0,10000,Free,0.0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up,,,,,left_only,Art & Design,
23,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,A kid's excessive ads. The types ads allowed a...,Negative,-0.25,1.0,both,Art & Design,Pretend Play
24,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,It cute.,Positive,0.5,1.0,both,Art & Design,Pretend Play
25,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,It bad >:(,Negative,-0.725,0.833333,both,Art & Design,Pretend Play
26,Coloring book moana,ART_AND_DESIGN,3.9,967.0,14000000.0,500000,Free,0.0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up,like,Neutral,0.0,0.0,both,Art & Design,Pretend Play


Now let's check the column 'Last Updated'

In [410]:
df3['Last Updated'].value_counts()

July 31, 2018        3331
August 3, 2018       2464
August 2, 2018       2199
August 1, 2018       2046
August 6, 2018       1383
                     ... 
November 16, 2015       1
March 28, 2016          1
April 17, 2014          1
April 11, 2016          1
March 23, 2014          1
Name: Last Updated, Length: 1288, dtype: int64

In [415]:
df3['Last Updated'].sample(20)

15754      May 25, 2018
64965     July 18, 2018
9703      April 7, 2018
55575    August 6, 2018
51574      June 1, 2018
3054     August 1, 2018
65871      July 9, 2018
64797     July 31, 2018
33675     July 31, 2018
44961     July 23, 2018
52169     July 20, 2018
75913       May 6, 2018
78077       May 3, 2018
32762     July 31, 2018
29784     April 9, 2018
63962     July 31, 2018
11651     July 30, 2018
1381      July 27, 2018
36897     July 30, 2018
53772     July 31, 2018
Name: Last Updated, dtype: object

all good with this column

Now let's check the column 'Current Ver'

In [416]:
df3['Current Ver'].value_counts()

Varies with device    12199
1.0.6                   675
4.0.0                   674
1.0                     579
7.9.3                   535
                      ...  
17.4.11                   1
6.7.15.7                  1
1.8.7.0                   1
5.1.10                    1
0.3.4                     1
Name: Current Ver, Length: 2608, dtype: int64

In [418]:
df3['Current Ver'].sample(20)

8950                  2.0.5
48226                 37893
513                    1.37
9217                  6.4.1
23129    Varies with device
61794         10.9.8 (Play)
74808    Varies with device
15830    Varies with device
15095                 2.3.2
59773    Varies with device
80614                   1.5
4349                3.1.4.0
53077               4.6.2.0
12043                   3.0
75505                 2.3.3
24536                  4.90
63649    Varies with device
66312                   1.1
19821    Varies with device
34857                 7.9.3
Name: Current Ver, dtype: object

all good with thsi column!

Now let's check the column 'Android Ver'

In [419]:
df3['Android Ver'].value_counts()

Varies with device    11444
4.1 and up            10763
4.0.3 and up           7147
4.4 and up             4437
4.0 and up             3747
5.0 and up             3153
2.3 and up             2129
4.2 and up             1542
3.0 and up              676
2.3.3 and up            655
4.3 and up              563
6.0 and up              416
2.2 and up              382
2.1 and up              304
1.6 and up              152
7.0 and up              110
3.2 and up               65
2.0 and up               58
1.5 and up               53
7.1 and up               42
4.0.3 - 7.1.1            33
5.1 and up               15
3.1 and up                8
2.0.1 and up              6
4.4W and up               5
8.0 and up                4
5.0 - 8.0                 3
1.0 and up                2
7.0 - 7.1.1               1
4.1 - 7.1.1               1
5.0 - 6.0                 1
Name: Android Ver, dtype: int64

In [422]:
df3['Android Ver'].sample(20)

3605             4.2 and up
49205            4.3 and up
77114          2.3.3 and up
37555          4.0.3 and up
33038          4.0.3 and up
70754            4.2 and up
5672     Varies with device
52135            4.1 and up
79693          4.0.3 and up
1236     Varies with device
15360    Varies with device
65940    Varies with device
15612            5.0 and up
61878    Varies with device
29483            4.1 and up
73547            5.0 and up
43690            4.1 and up
72421            4.4 and up
23486            4.4 and up
74730          4.0.3 and up
Name: Android Ver, dtype: object

all good with this column!

In [423]:
df3.to_excel("clean-data.xlsx") 