<a href="https://colab.research.google.com/github/anh56/CoderSchool/blob/master/Assignment/Week%204/Google_Play_Store_Data_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Team 2 - Google Play Store

![](https://www.brandnol.com/wp-content/uploads/2019/04/Google-Play-Store-Search.jpg)

_For more information about the dataset, read [here](https://www.kaggle.com/lava18/google-play-store-apps)._

## Your tasks
- Name your team!
- Read the source and do some quick research to understand more about the dataset and its topic
- Clean the data
- Perform Exploratory Data Analysis on the dataset
- Analyze the data more deeply and extract insights
- Visualize your analysis on Google Data Studio
- Present your works in front of the class and guests next Monday

## Submission Guide
- Create a Github repository for your project
- Upload the dataset (.csv file) and the Jupyter Notebook to your Github repository. In the Jupyter Notebook, **include the link to your Google Data Studio report**.
- Submit your works through this [Google Form](https://forms.gle/oxtXpGfS8JapVj3V8).

## Tips for Data Cleaning, Manipulation & Visualization
- Here are some of our tips for Data Cleaning, Manipulation & Visualization. [Click here](https://hackmd.io/cBNV7E6TT2WMliQC-GTw1A)

_____________________________

## Some Hints for This Dataset:
- There are lots of null values. How should we handle them?
- Column `Installs` and `Size` have some strange values. Can you identify them?
- Values in `Size` column are currently in different format: `M`, `k`. And how about the value `Varies with device`?
- `Price` column is not in the right data type
- And more...


In [0]:
# Start your codes here!
# Your code here

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
%matplotlib inline
import warnings
import re
warnings.filterwarnings('ignore')
sns.set_style("whitegrid")


In [0]:
# get link from github
link ="https://raw.githubusercontent.com/anh56/CoderSchool/master/Assignment/Week%204/google-play-store.csv"
# read play store data from link
psdata = pd.read_csv(link)

In [125]:
#get first rows of the dataframe
psdata.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [126]:
psdata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
App               10841 non-null object
Category          10841 non-null object
Rating            9367 non-null float64
Reviews           10841 non-null object
Size              10841 non-null object
Installs          10841 non-null object
Type              10840 non-null object
Price             10841 non-null object
Content Rating    10840 non-null object
Genres            10841 non-null object
Last Updated      10841 non-null object
Current Ver       10833 non-null object
Android Ver       10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [127]:
#get null values amount
psdata.isnull().sum()

App                  0
Category             0
Rating            1474
Reviews              0
Size                 0
Installs             0
Type                 1
Price                0
Content Rating       1
Genres               0
Last Updated         0
Current Ver          8
Android Ver          3
dtype: int64

In [0]:
#since the nulls values only take a small fraction of the total amount of data, we can omit them
psdata.dropna(inplace = True)

In [129]:
psdata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9360 entries, 0 to 10840
Data columns (total 13 columns):
App               9360 non-null object
Category          9360 non-null object
Rating            9360 non-null float64
Reviews           9360 non-null object
Size              9360 non-null object
Installs          9360 non-null object
Type              9360 non-null object
Price             9360 non-null object
Content Rating    9360 non-null object
Genres            9360 non-null object
Last Updated      9360 non-null object
Current Ver       9360 non-null object
Android Ver       9360 non-null object
dtypes: float64(1), object(12)
memory usage: 1023.8+ KB


In [0]:
psdata["App"] = psdata["App"].str.strip('')

In [0]:
psdata["Category"] = psdata["Category"].str.strip()

In [132]:
#remove $ symbol in price
psdata['Price'] = [x.strip('$') for x in psdata['Price']]
psdata['Price'].sample(10)

6627    0
6022    0
5646    0
4533    0
9587    0
7241    0
6761    0
9166    0
8261    0
8063    0
Name: Price, dtype: object

In [133]:
#remove + symbol in installs
psdata['Installs'] = [x.strip('+') for x in psdata['Installs']]
psdata['Installs'].sample(10)

4053      1,000,000
4031    100,000,000
3008        100,000
8815        100,000
5251            100
2519         10,000
5693         50,000
1697    100,000,000
2357        100,000
7611        500,000
Name: Installs, dtype: object

In [134]:
psdata["Size"].unique() # see the values of size

array(['19M', '14M', '8.7M', '25M', '2.8M', '5.6M', '29M', '33M', '3.1M',
       '28M', '12M', '20M', '21M', '37M', '5.5M', '17M', '39M', '31M',
       '4.2M', '23M', '6.0M', '6.1M', '4.6M', '9.2M', '5.2M', '11M',
       '24M', 'Varies with device', '9.4M', '15M', '10M', '1.2M', '26M',
       '8.0M', '7.9M', '56M', '57M', '35M', '54M', '201k', '3.6M', '5.7M',
       '8.6M', '2.4M', '27M', '2.7M', '2.5M', '7.0M', '16M', '3.4M',
       '8.9M', '3.9M', '2.9M', '38M', '32M', '5.4M', '18M', '1.1M',
       '2.2M', '4.5M', '9.8M', '52M', '9.0M', '6.7M', '30M', '2.6M',
       '7.1M', '22M', '6.4M', '3.2M', '8.2M', '4.9M', '9.5M', '5.0M',
       '5.9M', '13M', '73M', '6.8M', '3.5M', '4.0M', '2.3M', '2.1M',
       '42M', '9.1M', '55M', '23k', '7.3M', '6.5M', '1.5M', '7.5M', '51M',
       '41M', '48M', '8.5M', '46M', '8.3M', '4.3M', '4.7M', '3.3M', '40M',
       '7.8M', '8.8M', '6.6M', '5.1M', '61M', '66M', '79k', '8.4M',
       '3.7M', '118k', '44M', '695k', '1.6M', '6.2M', '53M', '1.4M',
      

In [135]:
psdata['Size'].replace('1000+', 'Varies with device', regex =True) # notice the 1000+ value is the odd value, replace it

0                       19M
1                       14M
2                      8.7M
3                       25M
4                      2.8M
5                      5.6M
6                       19M
7                       29M
8                       33M
9                      3.1M
10                      28M
11                      12M
12                      20M
13                      21M
14                      37M
16                     5.5M
17                      17M
18                      39M
19                      31M
20                      14M
21                      12M
22                     4.2M
24                      23M
25                     6.0M
26                      25M
27                     6.1M
28                     4.6M
29                     4.2M
30                     9.2M
31                     5.2M
                ...        
10795                  4.0M
10796                  7.8M
10797                   46M
10799                  6.8M
10800               

In [0]:
psdata['Size'].replace('Varies with device', np.nan, inplace = True) # replace varies with device with nan values

In [0]:
psdata['Size'] = (psdata['Size'].replace('[kM]$', '', regex = True).astype(float) * psdata['Size'].str.extract('[\d\.]+([kM]+)', expand= False).fillna(1).replace(['k','M'],[10**3, 10**6]).astype(int))
# replace kb and mb with bytes

In [138]:
psdata['Size']

0        19000000.0
1        14000000.0
2         8700000.0
3        25000000.0
4         2800000.0
5         5600000.0
6        19000000.0
7        29000000.0
8        33000000.0
9         3100000.0
10       28000000.0
11       12000000.0
12       20000000.0
13       21000000.0
14       37000000.0
16        5500000.0
17       17000000.0
18       39000000.0
19       31000000.0
20       14000000.0
21       12000000.0
22        4200000.0
24       23000000.0
25        6000000.0
26       25000000.0
27        6100000.0
28        4600000.0
29        4200000.0
30        9200000.0
31        5200000.0
            ...    
10795     4000000.0
10796     7800000.0
10797    46000000.0
10799     6800000.0
10800    12000000.0
10801    19000000.0
10802    28000000.0
10803    81000000.0
10804    17000000.0
10805    15000000.0
10809    24000000.0
10810    21000000.0
10812    13000000.0
10814    31000000.0
10815     4900000.0
10817     8000000.0
10819     3600000.0
10820     8600000.0
10826           NaN


In [0]:
# fill varies with devices value with mean value from size of each category
psdata['Size'].fillna(psdata.groupby('Category')['Size'].mean(), inplace= True)

In [142]:
#drop current version since each app use a different way to calculate version and the column does not possess valueable data
psdata.drop(columns = 'Current Ver')

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19000000.0,10000,Free,0,Everyone,Art & Design,"January 7, 2018",4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14000000.0,500000,Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8700000.0,5000000,Free,0,Everyone,Art & Design,"August 1, 2018",4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25000000.0,50000000,Free,0,Teen,Art & Design,"June 8, 2018",4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2800000.0,100000,Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",4.4 and up
5,Paper flowers instructions,ART_AND_DESIGN,4.4,167,5600000.0,50000,Free,0,Everyone,Art & Design,"March 26, 2017",2.3 and up
6,Smoke Effect Photo Maker - Smoke Editor,ART_AND_DESIGN,3.8,178,19000000.0,50000,Free,0,Everyone,Art & Design,"April 26, 2018",4.0.3 and up
7,Infinite Painter,ART_AND_DESIGN,4.1,36815,29000000.0,1000000,Free,0,Everyone,Art & Design,"June 14, 2018",4.2 and up
8,Garden Coloring Book,ART_AND_DESIGN,4.4,13791,33000000.0,1000000,Free,0,Everyone,Art & Design,"September 20, 2017",3.0 and up
9,Kids Paint Free - Drawing Fun,ART_AND_DESIGN,4.7,121,3100000.0,10000,Free,0,Everyone,Art & Design;Creativity,"July 3, 2018",4.0.3 and up


In [145]:
type(psdata['Last Updated'])

pandas.core.series.Series

In [0]:
psdata['Last Updated'] = pd.to_datetime(psdata['Last Updated']) #initialize date time from last updated


In [0]:
psdata['Last Updated Year'] = pd.DatetimeIndex(psdata['Last Updated']).year

In [0]:
psdata['Last Updated Month'] = pd.DatetimeIndex(psdata['Last Updated']).month

In [0]:
psdata['Last Updated Day'] = pd.DatetimeIndex(psdata['Last Updated']).day

In [153]:
psdata.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9360 entries, 0 to 10840
Data columns (total 16 columns):
App                   9360 non-null object
Category              9360 non-null object
Rating                9360 non-null float64
Reviews               9360 non-null object
Size                  7723 non-null float64
Installs              9360 non-null object
Type                  9360 non-null object
Price                 9360 non-null object
Content Rating        9360 non-null object
Genres                9360 non-null object
Last Updated          9360 non-null datetime64[ns]
Current Ver           9360 non-null object
Android Ver           9360 non-null object
Last Updated Year     9360 non-null int64
Last Updated Month    9360 non-null int64
Last Updated Day      9360 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(3), object(10)
memory usage: 1.5+ MB
