# Team 2 - Google Play Store

![](https://www.brandnol.com/wp-content/uploads/2019/04/Google-Play-Store-Search.jpg)

_For more information about the dataset, read [here](https://www.kaggle.com/lava18/google-play-store-apps)._

## Your tasks
- Name your team!
- Read the source and do some quick research to understand more about the dataset and its topic
- Clean the data
- Perform Exploratory Data Analysis on the dataset
- Analyze the data more deeply and extract insights
- Visualize your analysis on Google Data Studio
- Present your works in front of the class and guests next Monday

## Submission Guide
- Create a Github repository for your project
- Upload the dataset (.csv file) and the Jupyter Notebook to your Github repository. In the Jupyter Notebook, **include the link to your Google Data Studio report**.
- Submit your works through this [Google Form](https://forms.gle/oxtXpGfS8JapVj3V8).

## Tips for Data Cleaning, Manipulation & Visualization
- Here are some of our tips for Data Cleaning, Manipulation & Visualization. [Click here](https://hackmd.io/cBNV7E6TT2WMliQC-GTw1A)

_____________________________

## Some Hints for This Dataset:
- There are lots of null values. How should we handle them?
- Column `Installs` and `Size` have some strange values. Can you identify them?
- Values in `Size` column are currently in different format: `M`, `k`. And how about the value `Varies with device`?
- `Price` column is not in the right data type
- And more...


In [17]:
# Start your codes here!
#import library
import pandas as pd
import numpy as np
import matplotlib.dates as mdates

#import data 
data = pd.read_csv('google-play-store.csv')
data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


In [18]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB


In [19]:
#preprocess columns Install
print(data['Installs'].unique())
id = data[data['Installs'] == 'Free'].index
data.loc[id,'Installs'] = '0'
data['Installs'] = data['Installs'].apply(lambda x: int(x.replace('+','').replace(',','')))
data.loc[id, 'Installs'] = data['Installs'].mean()
print(data['Installs'].unique())

['10,000+' '500,000+' '5,000,000+' '50,000,000+' '100,000+' '50,000+'
 '1,000,000+' '10,000,000+' '5,000+' '100,000,000+' '1,000,000,000+'
 '1,000+' '500,000,000+' '50+' '100+' '500+' '10+' '1+' '5+' '0+' '0'
 'Free']
[1.00000000e+04 5.00000000e+05 5.00000000e+06 5.00000000e+07
 1.00000000e+05 5.00000000e+04 1.00000000e+06 1.00000000e+07
 5.00000000e+03 1.00000000e+08 1.00000000e+09 1.00000000e+03
 5.00000000e+08 5.00000000e+01 1.00000000e+02 5.00000000e+02
 1.00000000e+01 1.00000000e+00 5.00000000e+00 0.00000000e+00
 1.54629124e+07]


In [20]:
#Preprocess columns Rating
print(data['Rating'].unique())
data.loc[np.isnan(data['Rating']),'Rating'] = data['Rating'][np.isnan(data['Rating']) == False].mean()
print(data['Rating'].unique())

[ 4.1  3.9  4.7  4.5  4.3  4.4  3.8  4.2  4.6  3.2  4.   nan  4.8  4.9
  3.6  3.7  3.3  3.4  3.5  3.1  5.   2.6  3.   1.9  2.5  2.8  2.7  1.
  2.9  2.3  2.2  1.7  2.   1.8  2.4  1.6  2.1  1.4  1.5  1.2 19. ]
[ 4.1         3.9         4.7         4.5         4.3         4.4
  3.8         4.2         4.6         3.2         4.          4.19333832
  4.8         4.9         3.6         3.7         3.3         3.4
  3.5         3.1         5.          2.6         3.          1.9
  2.5         2.8         2.7         1.          2.9         2.3
  2.2         1.7         2.          1.8         2.4         1.6
  2.1         1.4         1.5         1.2        19.        ]


In [21]:
#process columns reviews
print(data['Reviews'].unique())
data.loc[data[data['Reviews'] == '3.0M'].index, 'Reviews'] = '3000000'
data['Reviews'] = data['Reviews'].apply(lambda x: int(x))
print(data['Reviews'].unique())

['159' '967' '87510' ... '603' '1195' '398307']
[   159    967  87510 ...    603   1195 398307]


In [22]:
#Preprocess columns Size
print(data.Size.head())
def process_size(x):
    x = x.replace(',','.')
    if (x[-1] == 'M') | (x[-1] == '+'):
        x = float(x[:-1])*1024
    elif x[-1] == 'k':
        x = float(x[:-1])
    return x
data.loc[data[data["Size"] != 'Varies with device'].index, "Size"] = data['Size'][data["Size"] != 'Varies with device'].apply(process_size)
data.loc[data[data["Size"] == 'Varies with device'].index, "Size"] = data.loc[data[data["Size"] != 'Varies with device'].index, "Size"].mean()
print(data.Size.head())

0     19M
1     14M
2    8.7M
3     25M
4    2.8M
Name: Size, dtype: object
0     19456
1     14336
2    8908.8
3     25600
4    2867.2
Name: Size, dtype: object


In [23]:
#Preprocess columns price
print(data.Price.unique())
def process_price(x):
    try:
        return float(x[1:])
    except ValueError:
        return 0
data.Price = data.Price.apply(process_price)
print(data.Price.unique())


['0' '$4.99' '$3.99' '$6.99' '$1.49' '$2.99' '$7.99' '$5.99' '$3.49'
 '$1.99' '$9.99' '$7.49' '$0.99' '$9.00' '$5.49' '$10.00' '$24.99'
 '$11.99' '$79.99' '$16.99' '$14.99' '$1.00' '$29.99' '$12.99' '$2.49'
 '$10.99' '$1.50' '$19.99' '$15.99' '$33.99' '$74.99' '$39.99' '$3.95'
 '$4.49' '$1.70' '$8.99' '$2.00' '$3.88' '$25.99' '$399.99' '$17.99'
 '$400.00' '$3.02' '$1.76' '$4.84' '$4.77' '$1.61' '$2.50' '$1.59' '$6.49'
 '$1.29' '$5.00' '$13.99' '$299.99' '$379.99' '$37.99' '$18.99' '$389.99'
 '$19.90' '$8.49' '$1.75' '$14.00' '$4.85' '$46.99' '$109.99' '$154.99'
 '$3.08' '$2.59' '$4.80' '$1.96' '$19.40' '$3.90' '$4.59' '$15.46' '$3.04'
 '$4.29' '$2.60' '$3.28' '$4.60' '$28.99' '$2.95' '$2.90' '$1.97'
 '$200.00' '$89.99' '$2.56' '$30.99' '$3.61' '$394.99' '$1.26' 'Everyone'
 '$1.20' '$1.04']
[  0.     4.99   3.99   6.99   1.49   2.99   7.99   5.99   3.49   1.99
   9.99   7.49   0.99   9.     5.49  10.    24.99  11.99  79.99  16.99
  14.99   1.    29.99  12.99   2.49  10.99   1.5   19.99 

In [24]:
#Preprocess column Last Updated
print(data['Last Updated'].unique())
data = data.drop(10472)
data['Last Updated'] = pd.to_datetime(data['Last Updated'], format='%B %d, %Y')
data['Last Updated'] = data['Last Updated'].apply(lambda x: mdates.date2num(x))
print(data['Last Updated'].unique())

['January 7, 2018' 'January 15, 2018' 'August 1, 2018' ...
 'January 20, 2014' 'February 16, 2014' 'March 23, 2014']
[736701. 736709. 736907. ... 735253. 735280. 735315.]


In [25]:
#Save data to visualize at tablaeu
data.to_excel('GG_output.xlsx')