# App popularity analysis by genre

The goal of the company is to determine which types of mobile applications tend to attract larger user bases. Through an analysis of iOS application data, we can identify total downloads of individual applications as sorted by primary genre as a means of determining those applications' popularity. This should allow us to develop a conclusion about the specific types of applications that tend to draw the most users, and thus, to determine the genre of applications we should focus on developing.


In [1]:
from csv import reader
opened_ios = open('AppleStore.csv')
read_ios = reader(opened_ios)
ios_data = list(read_ios)
ios_header = ios_data[0]
ios_data = ios_data[1:]

opened_play = open ('googleplaystore.csv')
read_play = reader(opened_play)
play_data = list(read_play)
print(ios_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [2]:
def explore_dataset(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
explore_dataset(ios_data, 0, 2, True)
print ('\n')
explore_dataset(play_data, 0, 2, True) 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


The first row of information provides the titles of each column in each dataset, and the following data can be included to help us draw our conclusion:
*for iOS:*

-track_name

-price

-rating_count_tot

-user_rating

-prime_genre

*for Android:*

-App

-Category

-Rating

-Installs

-Type

This will help us to identify trends in not only the size of an app's user base, but its popularity among users over the app's lifetime. Further detail about each column can be found [here(ios)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) and [here(Android)](https://www.kaggle.com/lava18/google-play-store-apps?select=googleplaystore.csv)

In [3]:
print (play_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [4]:
del play_data[10473] #Has been run to delete error in ' Life Made WI-Fi Touchscreen Photo Frame'

### Identifying apps with multiple entries
As we explore out application data, we will find that a number of applications are repeated throughout our datasets. Here are some examples of duplicate entries found in the Google Play store dataset:

In [5]:
duplicate_apps = []
unique_apps = []

for row in play_data:
    title = row[0]
    if title in unique_apps:
        duplicate_apps.append(title)
    else:
        unique_apps.append(title)
print ('Number of Duplicate Apps:',len (duplicate_apps))
print ('\n')
print ('Examples of Duplicate Apps:', duplicate_apps[0:15])

Number of Duplicate Apps: 1181


Examples of Duplicate Apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


*This data will be cleaned systematically by identifying the most up-to-date entry for each duplicate entry and removing the rest from the dataset. We will identify the most recent entries by comparing the total reviews counts for each application, as the highest total reviews for each set of duplicates will be the most recent.*

In [6]:
print('Expected Length:',len(play_data)-1181)

Expected Length: 9660


In [7]:
for app in play_data:
    title = app[0]
    if title == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


In [8]:
reviews_max = {}
for row in play_data[1:]:
    title = row[0]
    n_reviews = float(row[3])
    if (title in reviews_max) and (reviews_max[title] < n_reviews):
        reviews_max[title] = n_reviews
    elif title not in reviews_max:
        reviews_max[title] = n_reviews
print('Actual Length:', len(reviews_max))


Actual Length: 9659


In [9]:
play_clean= []
already_added = []

for row in play_data[1:]:
    title = row[0]
    n_reviews = float(row[3])
    
    if (reviews_max[title] == n_reviews) and (title not in already_added):
        play_clean.append(row)
        already_added.append(title)

explore_dataset(play_clean,0,2,True)




['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


### Removing non-English apps from our dataset

Because our company intends to develop our applications for a primarily English-speaking audience, it is important to use only apps offered in English as the basis for our analysis. The app store data that we have includes several apps in other languages that offer litte insight into our target audience. We will remove non-English apps through the following steps:
    
    - Writing a function that analyzes strings to determine if those strings contain characters not used in the English language.
        - This function will iterate over a string as its input, looping a check for characters with unicode ID greater than 127,
            - 0-127 is the range of unicode ID recognized by ASCII as useful in the English language.
        - The the characters in each string contain characters with greater unicode ID values, the row containing that string will retrun "False" as the app is not in English.

In [10]:
def is_english (string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
        
    if non_ascii > 3:
        return False
    else:
        return True
        
print (is_english('Instagram'))
print (is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print (is_english('Docs To Go™ Free Office Suite'))
print (is_english('Instachat 😜'))


True
False
True
True


This function identifies apps as English or non-English, and can then be applied to the datasets to filter out non-English applications.

In [11]:
play_in_english = []
play_non_english = []

for row in play_clean:
    title = row[0]
    if is_english(title):
        play_in_english.append(row)
    else:
        play_non_english.append(row)

ios_in_english = []
ios_non_english = []

for row in ios_data:
    title = row[1]
    if is_english(title):
        ios_in_english.append(row)
    else:
        ios_non_english.append(row)

print (explore_dataset(play_in_english,0,3,True))
print ('\n')
print (explore_dataset(ios_in_english,0,3,True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13
None


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '

### Identifying Free Applications
Because we aim to develop free-to-download applications, it is necessary to draw conclusions from analysis of data about other free apps. To do this we must filter applications from our datasets that are not free through the following steps:

1. Identify the column in each dataset that describes the purchase price of the application
2. Determine which rows contain a price value that indicates a free-to-download application.
3. Isolate those free apps in a list separate from the non-free apps

In [12]:
play_free = []
for row in play_in_english:
    price = row[6]
    if price == 'Free':
        play_free.append(row)
ios_free = []
for row in ios_in_english:
    price = float(row[4])
    if price == 0.0:
        ios_free.append(row)

print(len(play_free))
print(len(ios_free))

8863
3222


Filtering these datasets by app price leaves us with 8863 Android apps and 3222 iOS apps, which is sufficient for our analysis.

### Determining which genres of apps are most popular

Now that we've identified the free applications in our datasets, our goal is to identify the performance of different genres of applications by genre. 

To determine the specific applications that are popular, those applications must be popular in not the Google Play Store and App Store markets. We will begin this analysis with a frequency table to identify popular genres in each market.

In [13]:
def freq_table(dataset, index):
    fq_table = {}
    total = 0
    for row in dataset:
        total += 1
        column = row[index]
        if column in fq_table:
            fq_table[column] += 1
        else:
            fq_table[column] = 1
    percent_table = {}
    for key in fq_table:
        percentage = (fq_table[key]/total) * 100
        percent_table[key] = percentage
    return percent_table


In [14]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [15]:
print ('iOS Popular Genres:')
display_table(ios_free, 11)
print('\n')
print('Android Popular Genres:')
display_table(play_free, 9)
print ('\n')
print ('Android Popular Categories:')
display_table(play_free, 1)

iOS Popular Genres:
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Android Popular Genres:
Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports :

These frequency tables above provide the following insights:

***iOS App Store** Free English Apps:*
 
    -Games dominate the App Store, with over half of free English apps falling within this genre.
    
    -There is a stark drop off in frequency for every genre after "Games,"
    
    -Apps that serve to entertain (Games, Social Networking, Sports, Music,) appear to be more popular than applications that serve a practical purpose (Productivity, Lifestyle, News)
    
        -Education breaks this trend, likely due to the popularity of education-oriented apps among student populations.
        
    -By the observations made, I would suggest an entertainment app, though a game may face considerable competition and may not perform as well due to dilution of the market.
    
        -This would depend on the user base of apps in these genres, as there are staggeringly more games, but that does not necessarily mean that all apps in this genre attract equally large audiences.
        

***Android** free English Apps:*

    -The most popular genres are "Tools" "Entertainment" and "Education"
    
    -Top Categories are "Family" and Game"
    
    -The Google Play Store uses far more specificity when sorting apps by genre and category, which may cause some apps identified as one genre in the iOS app Store to be sorted into a similar but different genre.
    
    -Apps of the "Tools" genre appear to be highly common, followed by the "Entertainment" genre; however, apps categorized as "Family" and "Game" are the most common.
    
    - These tables suggest that the company should develop an app that either entertains its user, or it serves a functional purpose to the business of its user.

**The information on these tables is limited to the frequency of apps within these genres, but does not clarify the actual popularity of these apps with users**

In order to find out more about the actual popularity of these genres of apps, we will determine the average active users or installs for each genre.

In the iOS App Store, we can look to the 'rating_count_tot' column as a proxy for number of downloads, and for the Google Play Store, we can look to the 'Installs' column.

The averages will be found through the following steps:
    
    -Isolate the apps of each genre
    -Find the sum of ratings or installs for all apps in each genre
    -Divide this number by the number of applications in each genre
    

In [16]:
genre_table = freq_table(ios_free,11)

In [19]:
genres_ios = freq_table(ios_free,11)
    
for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
            
    avg_rating = total/len_genre
    genres_ios[genre]=avg_rating
    print (genre,':',avg_rating)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


This table of averages gives us a more concise view of the average number of active users (represented by number of reviews) per app in each genre. We see the highest number of average users in the "Navigation" genre, followed then by "Reference," and then "Social Networking." These Averages give us a better idea of the popularity of apps in these genres.

Refering back to the frequency table of genres in the App Store, we can see that these three genres represent less than a combined 20 percent of apps in the App Store.

From this observation, it appears that social networking apps are a populare enough genre in the App Store while still not being too diluted in the market, and commands a relatively high per app user base. 

**It is my recommendation that the company pursue development of a social networking application.**

In [28]:
cat_table = freq_table(play_free,1)

for category in cat_table:
    total = 0
    len_cat = 0
    for row in play_free:
        category_app = row[1]
        if category_app == category:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_cat += 1
    avg_installs = total/len_cat
    print(category,':',avg_installs)


ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

The "Communication" genre dominates the list of average number of installs per app, with "Social" boasting a relatively large average installs per app. If we compare this to the average number of reviews per genre in the App Store we find that there is no communication app, but that "Social Networking"  has one of the highest average users per app. 

