# Analyzing App Data

## Introduction

The goal of this project is to determine the kinds of free apps that are likely to attract English speaking users by analyzing data from the Apple Store and the Google Play Store.

## Exploring Data

In [1]:
# Loading Data
from csv import reader
open_file=open("AppleStore.csv",encoding='utf8')
read_file=reader(open_file)
apple=list(read_file)
open_file=open('googleplaystore.csv',encoding='utf8')
read_file=reader(open_file)
android=list(read_file)

Datasets loaded above:

<a href='https://dq-content.s3.amazonaws.com/350/AppleStore.csv'> AppleStore.csv </a>

<a href='https://dq-content.s3.amazonaws.com/350/googleplaystore.csv'> googleplaystore.csv </a>


In [2]:
# Creating function that allows us to see some rows in dataset 
def explore_data(dataset,start,end,rows_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:",len(dataset[0]))

In [3]:
# Checking data for apple
explore_data(apple[1:],0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


In [4]:
#Column Names for apple
apple[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

The relevant columns are:
* 'track_name' (name of the app),
* 'price' (to check if its free),  
* 'rating_count_tot' (to find total no of ratings since 'installs' isn't available),
* 'prime_genre' (to determine the type of app)

In [5]:
# Checking data for android
explore_data(android[1:],0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [7]:
#Column names for android
android[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

The relevant columns are:
* 'App' (name of the app)
* 'Price' and 'Type' (to check if its free)
* 'Installs' and 'Reviews' (to find no of users who downloaded the app)
* 'Category' and 'Genres' (to determine the type of app)

## Deleting Incorrect Data

Error reported in row 10473 (missing category name)

In [8]:
# To delete column
print(android[10473])
print(len(android))
del android[10473]
print(len(android))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
10841
10840


## Dealing with duplicate entries

The dataset android has multiple duplicate entries

An example is provided below:

In [9]:
for app in android:
    name=app[0]
    if name=='Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


To determine the duplicate apps

In [10]:
duplicate_apps=[]
unique_apps=[]
for app in android:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps:',len(duplicate_apps))

Number of duplicate apps: 1180


While removing duplicates, we will only save duplicates with the most reviews as they will be more recent than other repeated entries

In [11]:
reviews_max={}
for row in android[1:]:
    name=row[0]
    n_reviews=float(row[3])
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
print(len(reviews_max))
# To check that sorting process worked
print(reviews_max["Instagram"])

9659
66577446.0


Now, we can find all rows with the maximum number of reviews for each unique app and hence create a new dataset with no duplicates

In [12]:
android_clean=[]
already_added=[]
for row in android[1:]:
    name=row[0]
    n_reviews=float(row[3])
    if n_reviews==reviews_max[name] and name not in already_added:
        # Need name not in already_added as there may be multiple duplicates with largest no of reviews
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))
# To check that only one value remains
for row in android_clean:
    if row[0]=="Instagram":
        print(row)

9659
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


There are no duplicates in the apple database (id column)

## Filtering Apps to only have English names

The function checks if the string has 3 or more characters (to account for emojis etc) that don't belong to the set of common English characters

In [13]:
def English_check(string):
    count=0
    for character in string:
        if ord(character)>127 or ord(character)<0:
            count+=1
    if count>3:
        return False
    return True

In [14]:
print(English_check('Instagram'))
print(English_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(English_check('Docs To Go™ Free Office Suite'))
print(English_check('Instachat 😜'))

True
False
True
True


Creating the English name datasets for both android and apple

In [15]:
new_android_clean=[]
new_apple=[]
for row in android_clean:
    if English_check(row[0]):
        new_android_clean.append(row)
for row in apple[1:]:
    if English_check(row[1]):
        new_apple.append(row)

In [18]:
print(len(new_android_clean))
print(len(new_apple))

9614
6183


Android dataset has 9614 rows and Apple dataset has 6183 rows

## Isolating free apps in both datasets

In [17]:
android_free=[]
apple_free=[]
for row in new_android_clean:
    # type column for android
    if row[7]=='0':
        android_free.append(row)
for row in new_apple:
    # price column for apple
    if float(row[4])==0:
        apple_free.append(row)

In [21]:
# Could covert using markdown
print(len(android_free))
print(len(apple_free))

8864
3222


Android dataset has 8864 rows and apple dataset has 3222 rows

## Determining Genre

As mentioned before to find genre, we will use column 1,9 for 'category' and 'genre' in android and column 11 for 'prime_genres' in apple.

First we create a frequence table to find the number of apps of each type

In [22]:
# Creating frequency table that outputs percentage share of each genre
def freq_table(dataset,index):
    temporary={}
    for row in dataset:
        value=row[index]
        if value in temporary:
            temporary[value]+=1
        else:
            temporary[value]=1
    output=temporary
    for value in output:
        output[value]/=(len(dataset)/100)
    return output

In [23]:
# Displaying entire frequency table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Finding genre for apple

First we find a frequency table for apple

In [25]:
# Frequency table for apple
print('Prime Genre')
display_table(apple_free,11)

Prime Genre
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.289882060831782
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.048417132216015
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620734
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310367
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.43451272501551835
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157667


A majority of these apps are games

Then we find a frequency table for average installs to determine which apps are more popular

In [26]:
#Average installs for apple created by using 'ratings_count_tot'
apple_table=freq_table(apple_free,11)
output={}
for genre in apple_table:
    total=0
    len_genre=0
    genre_app=genre
    for row in apple_free:
        if genre_app==row[11]:
            user_rating=float(row[5])
            total+=user_rating
            len_genre+=1
    average_genre=total/len_genre
    output[genre_app]=average_genre
output=dict(sorted(output.items(),key=lambda x:x[1],reverse=True))
output

{'Navigation': 86090.33333333333,
 'Reference': 74942.11111111111,
 'Social Networking': 71548.34905660378,
 'Music': 57326.530303030304,
 'Weather': 52279.892857142855,
 'Book': 39758.5,
 'Food & Drink': 33333.92307692308,
 'Finance': 31467.944444444445,
 'Photo & Video': 28441.54375,
 'Travel': 28243.8,
 'Shopping': 26919.690476190477,
 'Health & Fitness': 23298.015384615384,
 'Sports': 23008.898550724636,
 'Games': 22788.6696905016,
 'News': 21248.023255813954,
 'Productivity': 21028.410714285714,
 'Utilities': 18684.456790123455,
 'Lifestyle': 16485.764705882353,
 'Entertainment': 14029.830708661417,
 'Business': 7491.117647058823,
 'Education': 7003.983050847458,
 'Catalogs': 4004.0,
 'Medical': 612.0}

The best genre is probably 'Social Networking' since it is top 5 under both tests

### Finding genre for android

In [28]:
# Frequency table for android
print('Category')
display_table(android_free,1)

Category
FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.700361010830325
MEDICAL : 3.5311371841155235
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.237815884476534
HEALTH_AND_FITNESS : 3.079873646209386
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768953
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_D

Family is the most at 18% at then Games is second at 9%. The data is more spread out between genres compared to Apple

In [36]:
# Other possible frequency table for android
#print('Genre')
#display_table(android_free,9)

Tools is the most followed by entertainment. There are a lot of values in this table so we can ignore it and use the first table

In [34]:
# Average installs for android created using 'Installs'
android_table=freq_table(android_free,1)
output={}
for category in android_table:
    total=0
    len_category=0
    category_app=category
    for row in android_free:
        if category_app==row[1]:
            n_installs=row[5]
            n_installs=n_installs.replace('+',"")
            n_installs=n_installs.replace(',',"")
            n_installs=float(n_installs)
            total+=n_installs
            len_category+=1
    average_category=total/len_category
    output[category_app]=average_category
output=dict(sorted(output.items(),key=lambda x:x[1],reverse=True))
output

{'COMMUNICATION': 38456119.167247385,
 'VIDEO_PLAYERS': 24727872.452830188,
 'SOCIAL': 23253652.127118643,
 'PHOTOGRAPHY': 17840110.40229885,
 'PRODUCTIVITY': 16787331.344927534,
 'GAME': 15588015.603248259,
 'TRAVEL_AND_LOCAL': 13984077.710144928,
 'ENTERTAINMENT': 11640705.88235294,
 'TOOLS': 10801391.298666667,
 'NEWS_AND_MAGAZINES': 9549178.467741935,
 'BOOKS_AND_REFERENCE': 8767811.894736841,
 'SHOPPING': 7036877.311557789,
 'PERSONALIZATION': 5201482.6122448975,
 'WEATHER': 5074486.197183099,
 'HEALTH_AND_FITNESS': 4188821.9853479853,
 'MAPS_AND_NAVIGATION': 4056941.7741935486,
 'FAMILY': 3695641.8198090694,
 'SPORTS': 3638640.1428571427,
 'ART_AND_DESIGN': 1986335.0877192982,
 'FOOD_AND_DRINK': 1924897.7363636363,
 'EDUCATION': 1833495.145631068,
 'BUSINESS': 1712290.1474201474,
 'LIFESTYLE': 1437816.2687861272,
 'FINANCE': 1387692.475609756,
 'HOUSE_AND_HOME': 1331540.5616438356,
 'DATING': 854028.8303030303,
 'COMICS': 817657.2727272727,
 'AUTO_AND_VEHICLES': 647317.8170731707

Most categories have a few apps dominating entire dataset, want to take averages after removing huge apps to get a better understanding of which genre to choose

The next frequency table only consider apps below 100M downloads

In [29]:
# Add condition on n_installs
android_table=freq_table(android_free,1)
output={}
for category in android_table:
    total=0
    len_category=0
    category_app=category
    for row in android_free:
        if category_app==row[1]:
            n_installs=row[5]
            n_installs=n_installs.replace('+',"")
            n_installs=n_installs.replace(',',"")
            n_installs=float(n_installs)
            if n_installs<=100000000:
                total+=n_installs
                len_category+=1
    average_category=total/len_category
    output[category_app]=average_category
output=dict(sorted(output.items(),key=lambda x:x[1],reverse=True))
output

{'PHOTOGRAPHY': 14062572.365384616,
 'GAME': 12178377.421236873,
 'ENTERTAINMENT': 11640705.88235294,
 'COMMUNICATION': 9191689.13405797,
 'VIDEO_PLAYERS': 9177767.435897436,
 'PRODUCTIVITY': 8210674.4529411765,
 'SHOPPING': 7036877.311557789,
 'SOCIAL': 6440960.614718615,
 'TOOLS': 6184198.2177419355,
 'PERSONALIZATION': 5201482.6122448975,
 'WEATHER': 5074486.197183099,
 'TRAVEL_AND_LOCAL': 4364410.175609756,
 'MAPS_AND_NAVIGATION': 4056941.7741935486,
 'SPORTS': 3638640.1428571427,
 'BOOKS_AND_REFERENCE': 3523197.1428571427,
 'FAMILY': 3100833.247761194,
 'HEALTH_AND_FITNESS': 2365986.7720588236,
 'ART_AND_DESIGN': 1986335.0877192982,
 'FOOD_AND_DRINK': 1924897.7363636363,
 'EDUCATION': 1833495.145631068,
 'BUSINESS': 1712290.1474201474,
 'NEWS_AND_MAGAZINES': 1502841.8775510204,
 'LIFESTYLE': 1437816.2687861272,
 'FINANCE': 1387692.475609756,
 'HOUSE_AND_HOME': 1331540.5616438356,
 'DATING': 854028.8303030303,
 'COMICS': 817657.2727272727,
 'AUTO_AND_VEHICLES': 647317.8170731707,
 

So, it appears that the best genre in android section is Photography

## Conclusions
By combining information from apple and android dataset it appears that the best type of new app is a social media app involving photography.