# Data Analysis for Mobile Apps
This project will analyse data to help our developers understand what type of apps are likely to attract more users. This will help to increase  of main source of revenue through in-app ads.

# Opening Dataset
Open the datasets that we have (from [Kaggle - Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps) and [Kaggle - Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)), consisting of mobile apps data from Google Play Store and Apple Store. 

Here, we created a function `open_file` to read the csv file and store the datasets from Google and Apple into a list of lists respectively. Note that the directory pre-identified and the csv files must be saved in the correct directory.

In [1]:
def open_file(filename):
    opened_file = open(r'C:'+"\\"+'Users'+"\\"+'Andy'+"\\"+'Desktop'+"\\"+'Learning'+"\\"+'Dataquest'+"\\"+'Project_1'+"\\"+
                         filename, encoding='utf8')
    from csv import reader
    read_file = reader(opened_file)
    dataset = list(read_file)
    
    return dataset

In [2]:
dataset_google = open_file('googleplaystore.csv')

In [3]:
dataset_apple = open_file('Applestore.csv')

# Exploring Dataset
Here we define a function so that we can quickly explore the datasets that we have, by printing the selected slice of data in the output. We can also choose to print the header row and the total number of rows and columns. This will allow us to see if the datasets headers are the same, and the amount of data we have in each datasets.

In [4]:
def explore_data(dataset, start, end, table_header = False, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    
    if table_header:
        print('The data headers are: ')
        print(dataset[0])
        print('\n')
        
    for row in dataset_slice:
        print(row)
        print('\n')
    
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

Using the `explore_data function` we have defined, we can quickly see that the datasets headers are different. We also note that the number of columns are different (Google Play Store with 13 and Apple Store with 16), which suggest that the data information that we have for each datasets are different.

In [5]:
explore_data(dataset_google, 1, 3, True, True)

The data headers are: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows:  10842
Number of columns:  13


In [6]:
explore_data(dataset_apple, 1, 3, True, True)

The data headers are: 
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  7198
Number of columns:  16


# Cleaning Datasets

### Part 1: Removing Duplicate Entries of the Same App

#### From knowledge/information of duplicate entries

From the source of the datasets via Kaggle, we found out that there are erroneous data in the Google Play Store dataset, and it is found at row 10472. We will first check whether the data is indeed an error and thereafter remove it if so. Note that we are accessing index 10473 as the dataset consists of the header as the first row.

In [7]:
explore_data(dataset_google, 10473, 10474, True)

The data headers are: 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




From above, we notice that there is a missing entry in 'Category' field and resulting in the misalignment of data from the dataset headers (e.g. Rating for this app is 19 which is not possible as it should only be a maximum of 5). We will hence remove this row using the `del` function.

In [8]:
del(dataset_google[10473])

#### Checking through every entry in the dataset using app name to identify duplicates

To make sure that our analysis is accurate, we will need to check through if there is/are duplicate(s) in our dataset. We will first use some commonly known mobile apps for quick check. In this example, we have used Clash of Clans and Facebook.

In [9]:
for app in dataset_google[1:]:
    if app[0] == "Clash of Clans":
        print(app)
    
print('\n')
        
for app in dataset_google[1:]:
    if app[0] == "Facebook":
        print(app)  

['Clash of Clans', 'GAME', '4.6', '44891723', '98M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Strategy', 'July 15, 2018', '10.322.16', '4.1 and up']
['Clash of Clans', 'GAME', '4.6', '44891723', '98M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Strategy', 'July 15, 2018', '10.322.16', '4.1 and up']
['Clash of Clans', 'GAME', '4.6', '44893888', '98M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Strategy', 'July 15, 2018', '10.322.16', '4.1 and up']
['Clash of Clans', 'FAMILY', '4.6', '44881447', '98M', '100,000,000+', 'Free', '0', 'Everyone 10+', 'Strategy', 'July 15, 2018', '10.322.16', '4.1 and up']


['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


Above, we notice that there are duplicate entries for an apps called "Clash of Clans" and "Facebook". 

We will first determine which are the unique apps in the dataset by screening through using the app name.

Here we define a function `unique_duplicate_apps` which take in the `dataset` as well as an `index` to indicate the column in which the app name is found. This `index` is required as earlier we have noticed that the headers for the Google and Apple datasets are different. In fact, the app name for each are found under different column number.

In [10]:
def unique_duplicate_apps(dataset, index):
    unique_apps = []
    duplicate_apps = []
    index = int(index)
    
    for row in dataset[1:]:
        app_name = row[index]
        if app_name in unique_apps:
            duplicate_apps.append(app_name)
        else:
            unique_apps.append(app_name)
    
    print('Number of unique apps: ' + str(len(unique_apps)))
    print('Number of duplicate apps: ' + str(len(duplicate_apps)))
    print('\n')
    print('Examples of duplicate apps: ')
    print(duplicate_apps[:15])

In [11]:
unique_duplicate_apps(dataset_google, 0)

Number of unique apps: 9659
Number of duplicate apps: 1181


Examples of duplicate apps: 
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [12]:
unique_duplicate_apps(dataset_apple, 1)

Number of unique apps: 7195
Number of duplicate apps: 2


Examples of duplicate apps: 
['Mannequin Challenge', 'VR Roller Coaster']


From above, we note that there are **1181** duplicate apps in the **Google dataset** and **2** duplicate apps in the **Apple dataset**.

#### Determining which entry to keep for those with duplicates using total number of reviews

For the dataset with duplicate entries, we will have to determine which to keep and which to delete. We can use the total number of reviews as a measure to do so. The higher the number of reviews could suggest that the entry is the latest. 

We define a function `most_review` to create a dictionary with the key as the app name and the value as the number of reviews. We will cycle through the dataset and only keep the most number of reviews corresponding to each app as the value. Note that this time round, we will need two indices: `app_name_index` and `review_index` to indicate the column in which the app name and number of reviews are found (for Google, we will use `reviews` column; and for Apple, we will use `rating_count_tot` column),

We then do a check to make sure that all unique apps' total number of reviews are captured by comparing the length of our dictionary with the number of unique apps.

In [27]:
def most_review(dataset, app_name_index, review_index):
    
    most_review = {}

    for row in dataset[1:]:
        app_name = row[app_name_index]
        app_review = int(row[review_index])
        if app_name in most_review and app_review > most_review[app_name]:
            most_review[app_name] = app_review
        else:
            most_review[app_name] = app_review
        
    print(len(most_review))
    
    return most_review

In [28]:
most_review_google = most_review(dataset_google,0,3)

9659


In [29]:
most_review_apple = most_review(dataset_apple,1,5)

7195


By defining a function `clean_data`, using the most number of reviews as an indicator, we will cycle through the dataset and only copy the app data that has the most number of reviews into our cleaned dataset. 

Note that repeated entries may have the same total number of reviews, and we should only keep one of them. Hence, we have also created a holding list called `already_added` to check if it is a case of duplicate entry with same total number of reviews and that we have already added into our cleaned dataset.

In [31]:
def clean_data(dataset, most_review, app_name_index, review_index):
    dataset_clean = []
    already_added = []
    
    for row in dataset[1:]:
        app_name  = row[app_name_index]
        app_review = int(row[review_index])
        if (most_review[app_name] == app_review) and (app_name not in already_added):
            dataset_clean.append(row)
            already_added.append(app_name)
            
    print(len(dataset_clean))
    
    return dataset_clean

In [32]:
dataset_google_clean = clean_data(dataset_google, most_review_google, 0, 3)

9659


In [33]:
dataset_apple_clean = clean_data(dataset_apple, most_review_apple, 1, 5)

7195


### Part 2: Removing Non-English Apps

Since we only want to analyse mobile apps that have english name, we will need to check through the mobile app datasets and only keep those with english name.

To do so, we have defined a function `check_english` to screen through character by character based on ASCII value to determine if it is an english character. While it will not be fully deterministic, we will attempt to classify as non-english: 
* If there are more than three non-english characters <u>or</u>
* If the number of non-english characters is the exact length of the app name.

In [40]:
def check_english(word):
    index = 0
    for char in word:
        if ord(char) > 127:
            index += 1
    
    if index > 3 or index == len(word):
        return False
    else:
        return True

Using the function `check_english`, we run our dataset through the function and updated with a new dataset that only consist of english mobile apps.

In [43]:
dataset_google_clean_english = []
dataset_google_clean_non = []

for row in dataset_google_clean:
    app_name = row[0]
    if check_english(app_name):
        dataset_google_clean_english.append(row)
    else:
        dataset_google_clean_non.append(row)
        
print(len(dataset_google_clean_english))

9614


In [45]:
dataset_apple_clean_english = []
dataset_apple_clean_non = []

for row in dataset_apple_clean:
    app_name = row[1]
    if check_english(app_name):
        dataset_apple_clean_english.append(row)
    else:
        dataset_apple_clean_non.append(row)
        
print(len(dataset_apple_clean_english))

6163


In [63]:
print('Total number of non-english app: ' + str(len(dataset_google_clean_non)))
print('\nExamples:')
for row in dataset_google_clean_non[:5]:
    print(row[0])


Total number of non-english app: 45

Examples:
Flame - درب عقلك يوميا
သိင်္ Astrology - Min Thein Kha BayDin
РИА Новости
صور حرف H
L.POINT - 엘포인트 [ 포인트, 멤버십, 적립, 사용, 모바일 카드, 쿠폰, 롯데]


In [66]:
print('Total number of non-english app: ' + str(len(dataset_apple_clean_non)))
print('\nExamples:')
for row in dataset_apple_clean_non[:5]:
    print(row[1])

Total number of non-english app: 1032

Examples:
爱奇艺PPS -《欢乐颂2》电视剧热播
聚力视频HD-人民的名义,跨界歌王全网热播
优酷视频
网易新闻 - 精选好内容，算出你的兴趣
淘宝 - 随时随地，想淘就淘


### Part 3: Removing Paid apps

To have a sense of what are the types of formatting for the price in the datasets, we do a simple sampling by printing the list. We notice that likely free apps are characterised by '0' while a paid app would have the symbol '$' preceding it. We will need to remove this from the string before converting to a float.

In [67]:
price_list = []

for row in dataset_google_clean_english:
    price = row[7]
    price_list.append(price)
    
print(price_list[950:1000])

['0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '$2.99', '0', '0', '0', '0', '0', '0', '0', '$3.99', '0', '0', '0', '0', '$2.99', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0', '0']


Firstly, we will check for the character `$` and remove it. This is to identify apps that may be listed as `$`0.0 and to avoid ommitting relevant data. 

We then check to make sure that the mobile app is free (listed as 0.0) and save it into a new list.

In [69]:
dataset_google_clean_english_free = []

for row in dataset_google_clean_english:
    price = row[7]
    if '$' in row[7]:
        price = price[1:]
        
    price = float(price)
    if price == 0.0:
        dataset_google_clean_english_free.append(row)

print(len(dataset_google_clean_english_free))        

8864


In [70]:
dataset_apple_clean_english_free = []

for row in dataset_apple_clean_english:
    price = row[4]
    if '$' in row[4]:
        price = price[1:]
        
    price = float(price)
    if price == 0.0:
        dataset_apple_clean_english_free.append(row)

print(len(dataset_apple_clean_english_free))  

3204


# Analysing Datasets

We will build two functions we can use to analyse the frequency tables:

* One function called `freq_table` to generate frequency tables that show percentages.
* Another function called `display_table` that we can use to display the percentages in a descending order.

In [71]:
def freq_table(dataset, index):
    
    data_table = {}
    
    for row in dataset:
        value = row[index]
        if value in data_table:
            data_table[value] += 1
        else:
            data_table[value] = 1
            
    percentage_table = {}
    total = 0
    
    for key in data_table:
        total += data_table[key]
        
    for key in data_table:
        percentage_table[key] = (data_table[key] / total) * 100
         
    return percentage_table

In [85]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0], '%')

### Part 1: Analysis by Breakdown of App Category/Genre

We will use the functions we have defined above to determine the most common app genres/categories in the Google Play Store and Apple Store.

Note that while Google Play Store dataset offers category and genres field, we will use the category field as genres field is too granular and specific and app creator can choose more than one genres field to list their app under, as compared to category field in which the app creater can only choose one to list under.

In [78]:
display_table(dataset_google_clean_english_free, 1) #index 1 correspond to Category

FAMILY : 19.223826714801444 %
GAME : 9.510379061371841 %
TOOLS : 8.461191335740072 %
BUSINESS : 4.580324909747293 %
LIFESTYLE : 3.9034296028880866 %
PRODUCTIVITY : 3.892148014440433 %
FINANCE : 3.7003610108303246 %
MEDICAL : 3.5424187725631766 %
SPORTS : 3.4183212996389893 %
PERSONALIZATION : 3.3167870036101084 %
COMMUNICATION : 3.2490974729241873 %
HEALTH_AND_FITNESS : 3.068592057761733 %
PHOTOGRAPHY : 2.944494584837545 %
NEWS_AND_MAGAZINES : 2.7978339350180503 %
SOCIAL : 2.6624548736462095 %
TRAVEL_AND_LOCAL : 2.33528880866426 %
SHOPPING : 2.2450361010830324 %
BOOKS_AND_REFERENCE : 2.1435018050541514 %
DATING : 1.861462093862816 %
VIDEO_PLAYERS : 1.782490974729242 %
MAPS_AND_NAVIGATION : 1.3989169675090252 %
FOOD_AND_DRINK : 1.2409747292418771 %
EDUCATION : 1.128158844765343 %
LIBRARIES_AND_DEMO : 0.9363718411552346 %
AUTO_AND_VEHICLES : 0.9250902527075812 %
ENTERTAINMENT : 0.8799638989169676 %
HOUSE_AND_HOME : 0.8235559566787004 %
WEATHER : 0.8009927797833934 %
EVENTS : 0.7107400722

In [86]:
#display_table(dataset_google_clean_english_free, 9) #index 9 corresponds to Genres

In [82]:
display_table(dataset_apple_clean_english_free, 11) #index 11 corresponds to prime_genre

Games : 58.14606741573034 %
Entertainment : 7.896379525593009 %
Photo & Video : 4.9937578027465666 %
Education : 3.682896379525593 %
Social Networking : 3.245942571785269 %
Shopping : 2.6217228464419478 %
Utilities : 2.528089887640449 %
Sports : 2.153558052434457 %
Music : 2.0599250936329585 %
Health & Fitness : 2.028714107365793 %
Productivity : 1.7478152309612984 %
Lifestyle : 1.5605493133583022 %
News : 1.3420724094881398 %
Travel : 1.2172284644194757 %
Finance : 1.0923845193508115 %
Weather : 0.8739076154806492 %
Food & Drink : 0.8114856429463172 %
Reference : 0.5617977528089888 %
Business : 0.5305867665418227 %
Book : 0.4057428214731586 %
Navigation : 0.18726591760299627 %
Medical : 0.18726591760299627 %
Catalogs : 0.12484394506866417 %


We notice the top 3 saturated categories/genres for each app store:
* Google Play Store: **Family**, **Game**, and **Tools**.
* Apple Store: **Games**, **Entertainment**, and **Photo & Video**.

### Part 2: Analysis by Breakdown of Number of Installs/Reviews

We will also analyse the poularity of apps (have the most users) by calculating the average number of installs for each app by category or by number of ratings left on the app. 

In [114]:
google_cat_table = freq_table(dataset_google_clean_english_free, 1)

old_avg_install = 0

for category in google_cat_table:
    total = 0
    len_category = 0
    for row in dataset_google_clean_english_free:
        category_app = row[1]
        if category_app == category:
            install = row[5]
            install = install.replace(',','')
            install = install.replace('+','')
            install = float(install)
            total += install
            len_category += 1
            
    avg_install = total / len_category
    print(category, ': ', avg_install)
    
    if old_avg_install < avg_install:
        old_avg_install = avg_install
        category_name = category

print('\n')
print('Top category with highest average number of installs:')
print(category_name,': ',old_avg_install)

ART_AND_DESIGN :  1986335.0877192982
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1704192.3399014778
COMICS :  817657.2727272727
COMMUNICATION :  38326063.197916664
DATING :  854028.8303030303
EDUCATION :  1768500.0
ENTERTAINMENT :  9146923.076923076
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4167457.3602941176
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  12914435.883748516
FAMILY :  5180161.789906103
MEDICAL :  123064.7898089172
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  4274688.722772277
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16772838.591304347
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS :  24790074

In [115]:
apple_prime_genre_table = freq_table(dataset_apple_clean_english_free, 11)

old_avg_install = 0

for genre in apple_prime_genre_table:
    total = 0
    len_genre = 0
    for row in dataset_apple_clean_english_free:
        genre_app = row[11]
        if genre_app == genre:
            no_user_rating = float(row[5])
            total += no_user_rating
            len_genre += 1
    
    avg = total / len_genre
    print(genre, ': ', avg)
    
    if old_avg_install < avg:
        old_avg_install = avg
        genre_name = genre

print('\n')
print('Top genre with highest average number of installs:')
print(genre_name,': ',old_avg_install)

Social Networking :  72916.54807692308
Photo & Video :  28441.54375
Games :  22922.805152979065
Music :  57326.530303030304
Reference :  74942.11111111111
Health & Fitness :  23298.015384615384
Weather :  52279.892857142855
Utilities :  18684.456790123455
Travel :  28964.05128205128
Shopping :  26919.690476190477
News :  21248.023255813954
Navigation :  86090.33333333333
Lifestyle :  16815.48
Entertainment :  14085.284584980238
Food & Drink :  33333.92307692308
Sports :  23008.898550724636
Book :  42816.846153846156
Finance :  32367.02857142857
Education :  7003.983050847458
Productivity :  21028.410714285714
Business :  7491.117647058823
Catalogs :  4004.0
Medical :  612.0


Top genre with highest average number of installs:
Navigation :  86090.33333333333
