# Data analysis of mobile application stores

This project focuses on the application of data science techniques on a data set collected from Google Play, the application store for Android mobile devices.
The main goal of this project is to generate insights that serve as guidelines to application developers, helping them on improving the market performance of their apps on the application store and increase the profitability.

In this work, it will be used a data set that contains data about apps from Google Play Store (Android devices). Let's get started by opening the file to proceed with data extraction.

In [1]:
from csv import reader

play_store_file = open('googleplaystore.csv')
play_store_data = reader(play_store_file)
play_store_dataset = list(play_store_data)
play_store_file.close()

Now we have a list of lists containing all the rows extracted from the .csv file organized in a structured manner.

As we are working with pure data, in text-mode, we will often need to print either the whole data set or only some slices of it in order to visualize what we are doing. It is very annoying to write code for doing that whenever we need. So we are going to define a function below that will simplify our lives:

In [2]:
def explore_data(dataset, start=0, end=0, rows_and_columns=False):
    if end == 0:
        end = len(dataset)
    
    dataset_slice = dataset[start:end]    
    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The function above receives as inputs the data set, the indexes of the starting and ending points of the interval we want to print. The last argument defines whether we want to print the shape (number of rows and columns) of the data set too. In the next code snippet we make use of the `explore_data()` function.

In [3]:
explore_data(play_store_dataset, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


As we can see, the data set has 10,842 rows (including the header row) and 13 columns. At this point, we are going to start the data cleansing process. Inside the discussion section of the data set page on Kaggle, people warned about a row that has a missing data point (column). Examining the conversation, the conclusion was that the problematic row is at the index 10,473. Let us print the row.

In [4]:
print(play_store_dataset[10473]) #`category` data point is missing at this row
print('\n')
print('Row #10,473 data point count: ' + str(len(play_store_dataset[10473])))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Row #10,473 data point count: 12


Analyzing the row, we realize that it lacks the Category information. As we can't determine that data in order to fill in the gap, we are going to delete the whole row, with no damage to the data set, since we still have tens of thousands of other rows.

In [5]:
del play_store_dataset[10473]

So we end up with 10,840 rows in the data set. We are going to print this information separately as a zero mark to let us keep track of the transformations we will apply to the data set.

In [6]:
print(len(play_store_dataset))

10841


Before we advance, let's split the data set to store the header apart the data itself for simplification purposes.

In [7]:
play_store_header = play_store_dataset[0]
play_store_dataset = play_store_dataset[1:]

At this stage, we will continue the process of data cleansing. Now, let's look for duplicate entries in our data set. We will carry out this task by running the code below:

In [8]:
play_store_duplicate_rows = []
play_store_unique_rows = []

for row in play_store_dataset:
    app_name = row[0]
    if app_name in play_store_unique_rows:
        play_store_duplicate_rows.append(app_name)
    else:
        play_store_unique_rows.append(app_name)
print("Number of duplicate entries: " + str(len(play_store_duplicate_rows)))

Number of duplicate entries: 1181


So the data set has 1,181 rows that brings redundant information. For instance, Instagram has 4 entries in the data set as we can see below:

In [9]:
for row in play_store_dataset:
    app_name = row[0]
    if app_name == 'Instagram':
        print(row)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




Now that we have determined the quantity of duplicate entries in our data set, we must delete them. However, we won't do that randomly. It is important to stick to a criterion for deleting theses entries.

Among the duplicate rows, we are going to maintain, for each unique app, only that one with the greatest rating count. The reason for this approach is that the greater the number of ratings is, the more up-to-date should be that row.

The code below accomplishes the task of selecting the entry with the higher number of reviews for each app:

In [10]:
reviews_max = {}
for row in play_store_dataset:
    name = row[0]
    n_reviews = float(row[3])
    if name not in reviews_max:
        reviews_max[name] = n_reviews
    elif reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews

Now, let's finally remove the duplicate rows. Instead of overwriting `play_store_dataset`, we are going to create a new list called `android_clean` to which we will append one line of information for each app, according to the criteria defined before. The code below takes care of this:

In [11]:
android_clean = []
already_added = []
for row in play_store_dataset:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))

9659


In the next step, we are going to finish the data cleansing by removing the data set entries that contains information about non-english apps, because we are interested in analyzing only those apps directed to the english-speakers audience. Below we define a function for detecting if a string has predominantly english alphabet characters. That will be useful for accomplishing our goal.

In [12]:
def is_an_english_word(a_string):
    non_english_char_count = 0
    for char in a_string:
        if ord(char) > 127:
            non_english_char_count += 1
            if non_english_char_count > 3:
                return False
    return True

# test cases
print(is_an_english_word('Instagram'))
print(is_an_english_word('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_an_english_word('Docs To Go™ Free Office Suite'))
print(is_an_english_word('Instachat 😜'))

True
False
True
True


Below is the code that indeed removes non-english apps. Note that we create another list to storage the cleaning result.

In [13]:
only_english_apps = []
for row in android_clean:
    app_name = row[0]
    if is_an_english_word(app_name):
        only_english_apps.append(row)
print(len(only_english_apps))

9614


The last criteria we have to follow is to focus only on the free apps. The code below creates a new list of lists containing only free apps:

In [14]:
free_only_english_apps = []
for row in only_english_apps:
    app_type = row[play_store_header.index('Type')]
    if app_type == 'Free':
        free_only_english_apps.append(row)
print(len(free_only_english_apps))

8863


At this point, we have finished the data cleansing. Now we are ready to begin the data analysis process. Here we are interested in investigating which aspects of an app have the higher weights in its success.

As a first step in this process, we can determine which app genres are the most common ones in the market. To achieve this target, we can build up a histogram (known as frequency table as well) using Python's data structure called dictionary. We assign the genre as the "key" attribute and its count as the "value" attribute and we do this for all the genres. The code written below is in charge of meeting this demand.

In [15]:
genre_histogram = {}
genre_index = play_store_header.index('Category')
for row in free_only_english_apps:
    genre = row[genre_index]
    if genre in genre_histogram:
        genre_histogram[genre] += 1
    else:
        genre_histogram[genre] = 1
print(genre_histogram)
print(max(genre_histogram.values()))

{'ART_AND_DESIGN': 57, 'AUTO_AND_VEHICLES': 82, 'BEAUTY': 53, 'BOOKS_AND_REFERENCE': 190, 'BUSINESS': 407, 'COMICS': 55, 'COMMUNICATION': 287, 'DATING': 165, 'EDUCATION': 103, 'ENTERTAINMENT': 85, 'EVENTS': 63, 'FINANCE': 328, 'FOOD_AND_DRINK': 110, 'HEALTH_AND_FITNESS': 273, 'HOUSE_AND_HOME': 73, 'LIBRARIES_AND_DEMO': 83, 'LIFESTYLE': 346, 'GAME': 862, 'FAMILY': 1675, 'MEDICAL': 313, 'SOCIAL': 236, 'SHOPPING': 199, 'PHOTOGRAPHY': 261, 'SPORTS': 301, 'TRAVEL_AND_LOCAL': 207, 'TOOLS': 750, 'PERSONALIZATION': 294, 'PRODUCTIVITY': 345, 'PARENTING': 58, 'WEATHER': 71, 'VIDEO_PLAYERS': 159, 'NEWS_AND_MAGAZINES': 248, 'MAPS_AND_NAVIGATION': 124}
1675


To simplify our work, we can write a function for building a generic frequency table:

In [16]:
def freq_table(dataset, index):
    table = {}
    n_elements = 0
    for row in dataset:
        n_elements += 1
        key = row[index]
        if key in table:
            table[key] += 1
        else:
            table[key] = 1
    percentage_table = {}
    for key in table:
        percentage_table[key] = table[key] / n_elements * 100.0
    return percentage_table

The code below turns the process of printing a frequency table simple. It calls the function that builds the frequency table we wrote above, therefore we can make use of only one function for doing all the job. Check it out next.

In [19]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key, value in table.items():
        inverted_tuple_format = (value, key)
        table_display.append(inverted_tuple_format)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [23]:
category_index = play_store_header.index('Category')
genre_index = play_store_header.index('Genres')
display_table(free_only_english_apps, genre_index)
print('\n')
display_table(free_only_english_apps, category_index)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S