# Android & iOS App Analysis

The purpose of this project was to analyze a sample dataset of 10,000 Android apps and 7,000 iOS apps reviewing genres, prices, ratings, and downloads to identify insights in to which apps are the most popular with consumers.

Goal: Learn what type of app would make sense for the business to create.

In [1]:
from csv import reader

#Turning ios dataset into a list
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

#Turning android dataset into a list
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

I then created a function that could be used to explore the datasets.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Nunber of columns:', len(dataset[0]))

Let's look at the Android dataset:

In [3]:
print(android_header)
print('\n')
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Nunber of columns: 13


Seems that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

Now for the ios dataset:

In [4]:
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Nunber of columns: 16


Looking at the ios dataset, the 'prime_genre' column could be useful.

Community feedback on the dataset identified that there was an app with missing data. Confirmed it at index 10472. I decided to remove it from the dataset.

In [5]:
print(android[10472])
del android[10472]
print('\n')
print('Remaining android apps:', len(android))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Remaining android apps: 10840


Community discussions identified there are duplicate apps within the android dataset.

In [6]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




Looking at the duplicates, I decided when I filter them out, I will only keep the instance that has the highest amount of ratings as this should be the longest running app.

In [7]:
unique_apps = []
duplicate_apps = []

for app in android:
    name = app[0]
    if name not in unique_apps:
        unique_apps.append(name)
    else:
        duplicate_apps.append(name)
        

print(len(unique_apps))
print(len(duplicate_apps))

9659
1181


In [8]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Dictionary length:',len(reviews_max))

Dictionary length: 9659


This matches my length of unique_apps meaning I now have a list and dictionary of unique apps.

Now I created a list of lists, containing only unique apps by iterating through the android dataset, and using the unique apps from the dataset and checking against the dictionary that has each unique app name as a key, and the value being the highest number of reviews. So if the iteration matches the app in the dataset, it will then append each app row into the new clean list.

In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
explore_data(android_clean, 0, 3, True)
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Nunber of columns: 13


Below I wrote a function that iterates through a string, and identifies using the ord() function if the object contains anything that's not an english letter. This was to filter out all apps that are non-english.

In [10]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1    
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜欢欢欢'))

True
False
True
False


Then it was time to use the function on both datasets.

In [11]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for row in ios:
    name = row[1]
    if is_english(name):
        ios_english.append(row)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Nunber of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

Once the non-english apps were removed, I needed to filter the datasets to only include free apps since that is the goal for our project. I first started this process by creating lists of only the free apps from both data sets.

In [12]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        

explore_data(android_final, 0, 3, True)
print('\n')
explore_data(ios_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Nunber of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

The end goal is to identify the apps for both markets that are the most successful. Next I built frequency tables to review the 'prime_genre' column of the ios data, and the 'Genres' and 'Category' columns of the android dataset.

I wrote a function that creates a frequency table of any column I input and takes the dataset and specific index of the desired column to product the desired frequency table.

In [13]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for app in dataset:
        total += 1
        item = app[index]
        if item in table:
            table[item] += 1
        else:
            table[item] = 1
            
    table_percentage = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentage[key] = percentage
        
    return table_percentage
    
            
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_tuple = (table[key], key)
        table_display.append(key_tuple)
        
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(ios_english, 11)
print('\n')
display_table(android_english, 1)
print('\n')
display_table(android_english, 9)

Games : 54.860100274947435
Entertainment : 7.261846999838266
Education : 6.6310852337053205
Photo & Video : 5.515122109008572
Utilities : 3.4449296458030085
Productivity : 2.7171276079573023
Health & Fitness : 2.6686074721009216
Music : 2.215752870774705
Social Networking : 2.037845705967977
Sports : 1.6820313763545207
Lifestyle : 1.6011644832605532
Shopping : 1.3747371825974446
Weather : 1.1159631246967492
Travel : 0.9704027171276078
News : 0.9218825812712276
Book : 0.8895358240336406
Reference : 0.8571890667960537
Business : 0.8571890667960537
Finance : 0.7924955523208799
Food & Drink : 0.7116286592269124
Navigation : 0.452854601326217
Medical : 0.3396409509946628
Catalogs : 0.08086689309396733


FAMILY : 19.325982941543582
GAME : 9.819013938007073
TOOLS : 8.61244019138756
BUSINESS : 4.358227584772207
MEDICAL : 4.108591637195756
PERSONALIZATION : 3.900561680882047
PRODUCTIVITY : 3.879758685250676
LIFESTYLE : 3.786145204909507
FINANCE : 3.588516746411483
SPORTS : 3.3804867900977738
CO

### Prime Genre Analysis - iOS

Analyzing the ios table, here are my conclusions:
- The most common app genre is **Games** making up 54%. The runner up would be **Entertainment** at 7%.

- Most of the apps appear to be designed for practical purposes, vs entertainment purposes.

- If we were to create solely a gaming app, or perhaps an app that was gaming but a combination of gaming and either education or photography, that would result in the most users.

### Category / Genre Analysis - Android

Analyzing the android table, here are my conclusions:
- The most popular categories are **family** at 19%, followed by **Game** at 9% and **Tools** at 8%.The most popular genres are **Tools** at 8%, followed by **Entertainment** and **Education** both at 5%.

- It appears that within the android genres, there isn't a single most popular genre they're fairly spread out in terms of genre utilization. For categories the family category holds almost 20% making it the most popular.

- Comparing both datasets, games seem to be most popular with ios by far, while android shows a more balanced landscape.

- My recommendation is to learn a little more data about the android dataset before making an app recommendation.

I then obtained the number of installs per genre for the android data, and the number of users per genre for the ios data.

In [14]:
prime_table = freq_table(ios_english, 11)

for genre in prime_table:
    counter = 0 # Sum of user ratings
    len_genre = 0 # Number of apps specific to each genre
    
    for app in ios_english:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            counter += n_ratings
            len_genre += 1
    avg_ratings = counter / len_genre
    print(genre, ':', avg_ratings)

Catalogs : 3465.0
Travel : 19030.183333333334
Entertainment : 8862.409799554565
Medical : 648.952380952381
News : 16980.315789473683
Navigation : 19370.821428571428
Health & Fitness : 10802.157575757576
Games : 15586.759433962265
Food & Drink : 19934.386363636364
Lifestyle : 8930.373737373737
Shopping : 26635.011764705883
Productivity : 8508.089285714286
Photo & Video : 14688.715542521993
Book : 10359.2
Reference : 27037.188679245282
Social Networking : 60253.84920634921
Music : 29047.109489051094
Utilities : 7927.525821596244
Education : 2472.278048780488
Sports : 15350.913461538461
Business : 5149.320754716981
Finance : 23353.530612244896
Weather : 23145.246376811596


Social networking, Music, and Shopping seem to be the most popular iOS categories. My recommendation would be to build one that fits either of these categories.

In [21]:
category_table = freq_table(android_english, 1)

for category in category_table:
    total = 0
    len_category = 0
    for app in android_english:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    average_n_installs = total / len_category
    print(category, ':', average_n_installs)

SOCIAL : 22961790.384937238
ART_AND_DESIGN : 1887285.0
LIBRARIES_AND_DEMO : 630903.6904761905
NEWS_AND_MAGAZINES : 9472807.04
ENTERTAINMENT : 11375402.298850575
TOOLS : 9785955.211352658
GAME : 14256217.600635594
HEALTH_AND_FITNESS : 3972300.388888889
FINANCE : 1319851.4028985507
PHOTOGRAPHY : 16636241.267857144
PARENTING : 525351.8333333334
SPORTS : 3373767.6861538463
FAMILY : 3345018.516684607
DATING : 828971.2176470588
BEAUTY : 513151.88679245283
MEDICAL : 96944.49873417722
SHOPPING : 6966908.880597015
COMICS : 817657.2727272727
LIFESTYLE : 1369954.7774725275
BOOKS_AND_REFERENCE : 7641777.871559633
BUSINESS : 1663758.627684964
FOOD_AND_DRINK : 1891060.2767857143
AUTO_AND_VEHICLES : 632501.3214285715
MAPS_AND_NAVIGATION : 3900634.7286821706
TRAVEL_AND_LOCAL : 13218662.767123288
EVENTS : 249580.640625
EDUCATION : 1782566.0377358492
HOUSE_AND_HOME : 1331540.5616438356
PERSONALIZATION : 4086652.4853333333
WEATHER : 4570892.658227848
PRODUCTIVITY : 15530942.008042896
VIDEO_PLAYERS : 2412