## Mobile Apps Data Analysis

This open source projects serves to demonstrate and apply the python programmign language to real data analytics scenerio.The purpose of the project is it, "help our developers understand what type of apps are likely to attract more users on Google Play and the App Store" (DataQuest.io).

Datasets include...
* [Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps)
* [Apple Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

*Code will be executed with the files in a local folder/relative file path*

In [190]:
# libraries to load
from csv import reader

In [191]:
## PROVIDED
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Part 1: Load the Data to Variables

Open, read, list

In [192]:
## Google dataset 
open_file = open('googleplaystore.csv')
read_file = reader(open_file)
google = list(read_file)
google_header = google[0]
google = google[1:]

## Apple dataset
open_file = open('AppleStore.csv')
read_file = reader(open_file)
apple = list(read_file)
apple_header = apple[0]
apple = apple[1:]

#### Explore the Data

Below the function above is used to list out columns to review in both datasets the columns of interest.

In [193]:
print(google_header)
explore_data(google,0,2, True)
# Columns of Interest: App, Category, Rating, Reviews, Type,
## Price, Genres

print(apple_header)
explore_data(apple,0,2, True)
# Columns of Interest: track_name, Price, rating_count, 
## user_rating, prime_genre

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '12

### Part 2: Data Cleaning

#### Delete Incorrect Data Rows

As remarked in the Kaggle discussion row [10472] is incorrect due to the Category column being '1.9'.

In [194]:
print(google[10472])  # incorrect row
print(google_header)  # header
print(google[0])      # correct row

print(len(google))
del google[10472]  # don't run this more than once
print(len(google))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
10841
10840


#### Remove Duplicate App Entries

##### Step 1:

As seen below there are some duplicated rows of data in the Google or Andriod datafile. This means that 1,181 apps have two or more entries.

For example, 'Instagram' is shows to have 4 entries only differing in the number of reviews.

Notably, the Apple or iOS dataset had no duplicates.

In [195]:
duplicate_g = []
unique_g = []

duplicate_a = []
unique_a = []

for app in google:
    name = app[0]
    if name in unique_g:
        duplicate_g.append(name)
    else:
        unique_g.append(name)

print('Number of duplicates apps in the Google/Android dataset is: ', len(duplicate_g))


for app in apple:
    name = app[0]
    if name in unique_a:
        duplicate_a.append(name)
    else:
        unique_a.append(name)

print('Number of duplicates apps in the Apple/iOS dataset is: ', len(duplicate_a))
print('\n')
print('Example of duplicates: ',duplicate_g[:5])
print('\n')
print('Example of multiple entries for "Instagram"')
for app in google:
    name = app[0]
    if name == 'Instagram':
        print(app)

Number of duplicates apps in the Google/Android dataset is:  1181
Number of duplicates apps in the Apple/iOS dataset is:  0


Example of duplicates:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


Example of multiple entries for "Instagram"
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with

##### Step 2:

As shown above the Google dataset needs to be cleaned of the duplicate data. To due this we create a dictionary which captures the Name of the app and the highest number of reviews.

Then this list is used to compare the entries with the google dataset, and a clean dataset is created which only allows for only one of highest reviewed entry for each app.

In [196]:
reviews_max = {}

for apps in google:
    name = apps[0]
    n_reviews = float(apps[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in  reviews_max:
        reviews_max[name] = n_reviews
        
print('Estimated length of reviews max is: ', (len(google)-1181))
print('Actual length of reviews max is: ', len(reviews_max))

Estimated length of reviews max is:  9659
Actual length of reviews max is:  9659


In [197]:
google_clean = []
already_added = []

for apps in google:
    name = apps[0]
    n_reviews = float(apps[3])
    if(reviews_max[name] == n_reviews) and (name not in already_added):
        google_clean.append(apps)
        already_added.append(name)

explore_data(google_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


#### Remove Non-English Apps

Both datasets have non-english entries.

This analysis focuses on the english speaking market; therefore, we need to develop a way to differentiate between different English and non-English apps.


##### Step 1:

Below is a function which interprets a string, converts each character to an ASCII number, and return False if any characters are non-English.

To minimize dataloss strings with 4 or more errors will return False.

In [198]:
def is_english(string):
    count = 0
    for character in string:
        if (ord(character) > 127) or (ord(character) < 0):
            count += 1
    if count > 3:
        return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('Áà±Â•áËâ∫PPS -„ÄäÊ¨¢‰πêÈ¢Ç2„ÄãÁîµËßÜÂâßÁÉ≠Êí≠'))
print(is_english('Docs To Go‚Ñ¢ Free Office Suite'))
print(is_english('Instachat üòú'))

True
False
True
True


##### Step 2:

We shall now filter through each dataset (Google and Apple) to identify the rows which shall be included in our analysis.

In [199]:
apple_english = []
google_english = []

for apps in apple:
    name = apps [1]
    if is_english(name) == True:
        apple_english.append(apps)

for apps in google_clean:
    name = apps [0]
    if is_english(name) == True:
        google_english.append(apps)
        
explore_data(google_english, 0, 3, True)
explore_data(apple_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

#### Isolate Free Apps

For this analysis we are only looking at apps that are free to download and install.

Our final step in the data cleaning part will be to isolate our two datasets to the scope of the project.

In [200]:
apple_free = []
google_free = []

for apps in apple_english:
    price = apps [4]
    if price == '0.0':
        apple_free.append(apps)

for apps in google_english:
    price = apps [7]
    if price == '0':
        google_free.append(apps)

explore_data(apple_free, 0, 3, True)        
explore_data(google_free, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite ‚Äì FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Vari

In [201]:
ios_final = apple_free
android_final = google_free

### Part 3: Analysis

In this scenerio we would like to assess which apps will be marketable in the Android and iOS App Store.

#### Inspect Popular Genres

Below we built two functions we can use to analyze the frequency tables:

* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order

In [202]:
def freq_table(dataset, index):
    table = {}
    table_percentages = {}
    total = 0
    # Create a table with a Key = count of occurances
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    # Change the Key to a %
    for key in table:
        percentage = (table[key]/total)*100
        table_percentages[key] = percentage
    
    return table_percentages
            
            
## PROVIDED
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [203]:
display_table(ios_final, -5) #'prime_genre'

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


In [204]:
display_table(android_final, 9) #Genre

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [205]:
display_table(andriod_final, 1) #Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

As shown above free apps in the Android and iOS App Store have many categories.

Some conclusions to the Frequency Tables shown above would be:
* The top three 'prime_genres' for the iOS App Store are Games (58%), Entertainment (8%), and Photo & Video (5%)
* The top three 'genres' for the Android App Store are Tools (8.5%), Entertainment (6%), and Education (5.3%)
* The top three 'categories' for the Android App Store are Family (18.9%), Game (9.7%), and Tools (8.5%)

At first glance apps there are more apps which offer an escape or make things easier. Because there are large numbers of apps in these categories development in a family friendly game would due will in both the iOS and Andriod App Store. Notably, the Andriod store has no genre which dominates the market vs. the iOS which favors Games. 

A note of caution is that a popular category that offers many free apps does not translate to many users. While there is correlation to find the correct market more analysis is needed.

#### Number of User Ratings

Below is a code which looks at the average '# of user ratings' within each iOS Genre to understand which Apps users engage more with.

In [206]:
genre_ios = freq_table(ios_final, -5)

for genre in genre_ios:
    total = 0
    len_genre = 0
    for row in ios_final:
        genre_app = row[-5]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    avg_rating = total / len_genre
    print(genre, ':', avg_rating)

Business : 7491.117647058823
Games : 22788.6696905016
News : 21248.023255813954
Food & Drink : 33333.92307692308
Education : 7003.983050847458
Medical : 612.0
Health & Fitness : 23298.015384615384
Finance : 31467.944444444445
Book : 39758.5
Sports : 23008.898550724636
Navigation : 86090.33333333333
Music : 57326.530303030304
Reference : 74942.11111111111
Lifestyle : 16485.764705882353
Shopping : 26919.690476190477
Utilities : 18684.456790123455
Catalogs : 4004.0
Weather : 52279.892857142855
Photo & Video : 28441.54375
Social Networking : 71548.34905660378
Entertainment : 14029.830708661417
Travel : 28243.8
Productivity : 21028.410714285714


Notably, unlike the initial findings users do not engage with Games more than other Apps. Users on average are more likely to review apps that are the Genre of: Navigation, Music, Weather, or/and Social Networking.

#### Number of User Installs

Below is a code which looks at the average '# of user installs' within each Andriod Category to understand which Apps users download more.

In [209]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)

BEAUTY : 513151.88679245283
GAME : 15588015.603248259
PRODUCTIVITY : 16787331.344927534
FAMILY : 3695641.8198090694
BUSINESS : 1712290.1474201474
ART_AND_DESIGN : 1986335.0877192982
EDUCATION : 1833495.145631068
LIFESTYLE : 1437816.2687861272
AUTO_AND_VEHICLES : 647317.8170731707
PARENTING : 542603.6206896552
HOUSE_AND_HOME : 1331540.5616438356
MEDICAL : 120550.61980830671
DATING : 854028.8303030303
COMICS : 817657.2727272727
TRAVEL_AND_LOCAL : 13984077.710144928
FINANCE : 1387692.475609756
LIBRARIES_AND_DEMO : 638503.734939759
MAPS_AND_NAVIGATION : 4056941.7741935486
SOCIAL : 23253652.127118643
ENTERTAINMENT : 11640705.88235294
COMMUNICATION : 38456119.167247385
FOOD_AND_DRINK : 1924897.7363636363
VIDEO_PLAYERS : 24727872.452830188
EVENTS : 253542.22222222222
PERSONALIZATION : 5201482.6122448975
BOOKS_AND_REFERENCE : 8767811.894736841
PHOTOGRAPHY : 17840110.40229885
WEATHER : 5074486.197183099
NEWS_AND_MAGAZINES : 9549178.467741935
TOOLS : 10801391.298666667
SHOPPING : 7036877.3115577

It seems that Andriod users would dowload a free App that allows them to play videos and be social because the average user dowload for these categories are:
* Video Players:   24727872
* Communication:   38456119
* Social:          23253652.127118643