# Analyzing Google Play Store Dataset

This notebook goes over the basics of working with CSV files in Jupyter Notebooks. 
As an example, we are using a dataset of [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps). The goal is to get insights into how the most popular categories of apps on Google Play.

In [8]:
import csv

# use some IPython magic to set floats to 2 decimal points
%precision 2  

with open('datasets/googleplaystore.csv', encoding="utf8") as csvfile:
    gps = list(csv.DictReader(csvfile)) # save dataset into a list of dictionaries (where each dictionary represents one app)

print(len(gps))
gps[0] # print the first entry to look over the structure

10841


OrderedDict([('App', 'Photo Editor & Candy Camera & Grid & ScrapBook'),
             ('Category', 'ART_AND_DESIGN'),
             ('Rating', '4.1'),
             ('Reviews', '159'),
             ('Size', '19M'),
             ('Installs', '10,000+'),
             ('Type', 'Free'),
             ('Price', '0'),
             ('Content Rating', 'Everyone'),
             ('Genres', 'Art & Design'),
             ('Last Updated', 'January 7, 2018'),
             ('Current Ver', '1.0.0'),
             ('Android Ver', '4.0.3 and up')])

In [18]:
gps[0].keys()

odict_keys(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'])

I am interested in two parameters: a **number of installs** and how are they related to the **app category** (genre). Let's first overview the number of installs. 

The values in the "Installs" column however contain "+" signs at the end of the number, some numbers are formatted with a comma, and some have wrong values ("Free"). Let's quickly format the value in a way we could process them later.

In [13]:
def clean_values(value):
    return value.replace("+", "").replace(",", "").replace("Free", "0")

print("Average num of downloads")
print(sum(int(clean_values(dic["Installs"])) for dic in gps) / len(gps))

print("\nMax")
print(max(int(clean_values(dic["Installs"])) for dic in gps))

print("\nMin")
print(min(int(clean_values(dic["Installs"])) for dic in gps))

Average num of downloads
15462912.414629648

Max
1000000000

Min
0


In [10]:
# Create a set for all the genres

genres = set(dic["Genres"] for dic in gps)

print(len(genres))

120


In [19]:
# Group apps by genres and get the average num of downloads by genre
installs_by_genre = []

for genre in genres:
    sum_install = 0
    apps_by_genre_count = 0
    for dic in gps:
        if dic["Genres"] == genre:
            sum_install += int(clean_values(dic["Installs"]))
            apps_by_genre_count =+1
    installs_by_genre.append((genre, sum_install / apps_by_genre_count)) # append tuple (genre, avg_by_genre)

# sort by avg_by_genre (at index 1) and reverse to get a descending-ordered list 
installs_by_genre.sort(key=lambda x : x[1])
installs_by_genre.reverse()

# let's see top 10 categories
top10 = installs_by_genre[:10]
top10

[('Communication', 32647276251.00),
 ('Productivity', 14176091369.00),
 ('Social', 14069867902.00),
 ('Tools', 11442771915.00),
 ('Arcade', 10727129155.00),
 ('Photography', 10088247655.00),
 ('Casual', 9662830740.00),
 ('Action', 9342039190.00),
 ('News & Magazines', 7496317760.00),
 ('Travel & Local', 6868787146.00)]