In [1]:
# This is the Final Project Tutorial File
# We will scrape data from https://www.kaggle.com/lava18/google-play-store-apps
# This project is hosted at https://github.com/clabbott/app-data-analysis

In [2]:
#import libraries
import pandas as pd
import datetime
import numpy as np

In [3]:
df = pd.read_csv(r"googleplaystore.csv") 
df.astype({'App':'str', 'Category':'str', 'Rating':'float32', 
           'Reviews':'str','Size':'str','Installs':'str',
          'Type':'str', 'Price':'str', 'Content Rating':'str', 
           'Genres':'str', 'Last Updated':'str', 'Current Ver':'str',
          'Android Ver':'str'})
# Do data cleaning here as detailed below

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10836,Sya9a Maroc - FR,FAMILY,4.5,38,53M,"5,000+",Free,0,Everyone,Education,"July 25, 2017",1.48,4.1 and up
10837,Fr. Mike Schmitz Audio Teachings,FAMILY,5.0,4,3.6M,100+,Free,0,Everyone,Education,"July 6, 2018",1.0,4.1 and up
10838,Parkinson Exercices FR,MEDICAL,,3,9.5M,"1,000+",Free,0,Everyone,Medical,"January 20, 2017",1.0,2.2 and up
10839,The SCP Foundation DB fr nn5n,BOOKS_AND_REFERENCE,4.5,114,Varies with device,"1,000+",Free,0,Mature 17+,Books & Reference,"January 19, 2015",Varies with device,Varies with device


The table we are scraping from has 13 columns which need to be processed differently and can be reduced and cleaned to a more manageable format. 

The App column is the name of the app and must be interpreted as a String. The names should all be unique. We will be dropping duplicate app names and keeping the apps that have more reviews.

The Category column is the general category of the app and must be interpreted as a String. Many of the apps will have the same category value. 

The Rating column is the average user rating for the app between 1 and 5, with some outliers that must be dropped due to bad data that doesn't fit within these bounds. This value is a float. 

The Reviews column is the number of user reviews for the app. This value is an integer.

The Size column is the size of the app, how much space it takes up on a device. This value contains units of Megabite and other sizes and we have standardized these values to an integer value so they can be understood in context of each other.

The Installs column is the number of devices that have installed the app. The raw data gives units in minimum possible values so we interpreted these minimum values into integer values. 

The Type column is whether it is a free app or not. This is a string value that we will interpret as a boolean value with True for Free and False for Not Free.

The Price column is the price of the app. Once we remove the dollar sign from the price of some of these values, this is interpreted as a float value. 

The Content Rating column is a string determining which group the app is developed for. We had to break these columns down into fewer categories with only a few unique values to be able to analyze the data. The categories we chose were "Everyone", "Teen", and "Adult".

The Genres Column is a succinct, general category for the app's content. Most apps fit into one of the multiple major categories but there are also a number of apps with more specific categories. For these apps, we placed them in one of the major categories by scanning for keywords that would give clues to their categorical placement. We formatted this column as strings. 

The Last Updated Column contains date information for when the app was last updated. We cleaned this data by parsing it using the datetime library so it was easier to sort through.

The Current Ver Column is a string detailing the current version of the app available on the Play Store at the time of scraping. 

The Andriod Ver Column is a string detailing the minimum required version of the app to work on an android device.


Notably, the scraper of the dataset included that the row for "Life Made WI-Fi Touchscreen Photo Frame" was formatted incorrectly so we had to individually format that row and switch the values to the correct columns before cleaning the data.

In [4]:
#Dropping Life Made WI-Fi Touchscreen Photo Frame which is malformed
df = df.drop([10472])

In [5]:
#Formatting App (Name)

#Dropping Duplicates, Keep app with most reviews
df = df.sort_values('Reviews', ascending=False)
df = df.drop_duplicates(subset='App', keep='first') #1181 Dropped Apps
df = df.sort_index() #Sort back by index

In [6]:
#Formatting Reviews

#Converting to Numeric
df['Reviews'] = pd.to_numeric(df['Reviews'])

In [7]:
#Formatting Size

#Converting "Varies with device" with NaN values
#Life Made WI-Fi Touchscreen Photo Frame with a size of 1000+ is also replaced with a NaN value
#This makes the column useable, but biased with against these values.
df['Size'] = df['Size'].replace("Varies with device", np.NaN)

#Converts Size from string to a kb value
#Values that end in Mb are converted to kb
def data_string_to_int(data_str):
    return {
        'k':float(data_str[:-1]),
        'M':float(data_str[:-1]) * 1000
    }.get(data_str[-1:], np.NaN)


df['Size'] = df['Size'].apply(lambda x:  data_string_to_int(x) if pd.notnull(x) else x)
df['Size'] = pd.to_numeric(df['Size'])

In [8]:
df.dtypes

App                object
Category           object
Rating            float64
Reviews             int64
Size              float64
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object