# Apps Analysis

This analysis will help our team understand which type of apps are most likely to generate a broad user base. Ultimately, knowledge of apps that generate a high number of users will enable our team to create apps that will be used widely, and ultimately generate more ad revenue for the company.

In [1]:
opened_apple_file = open('AppleStore.csv')
opened_google_file = open('googleplaystore.csv')
from csv import reader
read_apple_file = reader(opened_apple_file)
read_google_file = reader(opened_google_file)
apple_store_data = list(read_apple_file)
google_store_data = list(read_google_file)

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [None]:
explore_data(apple_store_data, 0, 5, rows_and_columns=True)
explore_data(google_store_data, 0, 5, rows_and_columns=True)

## Data set column names

|Apple Store Data   |Google Store Data   |
|-------------------|--------------------|
|id                 |App                 |
|track_name         |Category            |
|size_bytes         |Rating              |
|currency           |Reviews             |
|price              |Size                |
|rating_count_tot   |Installs            |
|rating_count_ver   |Type                |
|user_rating        |Price               |
|user_rating_ver    |Content Rating      |
|ver                |Genres              |
|cont_rating        |Last Updated        |
|prime_genre        |Current Ver         |
|sup_devices.num    |Android Ver         |
|ipadSc_urls.num    |N/A                 |
|lang.num           |N/A                 |
|vpp_lic            |N/A                 |


## Data Cleaning

As a note, 1 Google Play app was removed from the list because it was missing a 'Category'. Also, there are 1181 duplicate apps in the Google Pla data set. The code below displays the number of duplicates and lists the app names as well.

The duplicates will be removed by keeping only the record that has the highest number of review (Reviews column). This suggests that the more reviews the app has, the more recent the entry within the data set was created.

In [None]:
print(google_store_data[10473])
del google_store_data[10473] # only run this once

In [8]:
# create a list for unique entries and a list for duplicate entries
unique_apps = []
duplicate_apps = []

for row in google_store_data:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
# print the count and a the duplicate apps to confirm
print('Duplicates: ')
print(len(duplicate_apps))
print('Uniques: ')
print(len(unique_apps))
for record in duplicate_apps[:5]:
    print(record)
    print('\n')

Duplicates: 
1181
Uniques: 
9660
Quick PDF Scanner + OCR FREE


Box


Google My Business


ZOOM Cloud Meetings


join.me - Simple Meetings




In [14]:
# create dictionary with unique app names with highest number of reviews
reviews_max = {}

for i in google_store_data[1:]:
    name = i[0]
    num_reviews = float(i[3])
    
    if name not in reviews_max:
        reviews_max[name] = num_reviews
    else:
        if reviews_max[name] < num_reviews:
            reviews_max[name] = num_reviews
            
# print the number of unique records
print(len(reviews_max))

9659


In [19]:
# create a clean data set for Google Play data
google_data_clean = []
already_added = []

for app in google_store_data[1:]:
    name = app[0]
    num_reviews = float(app[3])
    
    if (reviews_max[name] == num_reviews) and (name not in already_added):
        google_data_clean.append(app)
        already_added.append(name)
        
print(len(google_data_clean))
print(len(already_added))
print(len(google_store_data[1:]))

9659
9659
10840


## Additional Data Cleaning

Both sets of data have apps names that are not in English. Because the focus for our company is free English apps, we will use the code below to remove apps that are not in English. There are 6182 English apps in the Apple Store data and 9615 English apps in the Google Play data.

In [20]:
# function to check whether the app names are in English
def is_english(a_string):
    char_counter = 0
    for char in a_string:
        if ord(char) > 127:
            char_counter += 1
            if char_counter > 3:
                return False
    
    return True

In [22]:
# create a subset of data for only English apps
apple_english_apps = []
google_english_apps = []

for app in apple_store_data:
    title = app[1]
    if(is_english(title)) and title not in apple_english_apps:
        apple_english_apps.append(app)

for app in google_data_clean:
    app_name = app[0]
    if(is_english(app_name)) and app_name not in google_english_apps:
        google_english_apps.append(app)
        
# print the number of English apps in each data set
print(len(apple_english_apps))
print(len(google_english_apps))

6184
9614


## Remove non-free apps

As mentioned, this analysis will focus on free apps, so the code below isolates data from each data set that represents free apps.

In [28]:
#remove non free apps
free_apple_apps = []
free_google_apps = []

for app in apple_english_apps[1:]:
    name = app[1]
    price = float(app[4])
    if price == 0.0 and name not in free_apple_apps:
        free_apple_apps.append(app)
        
for app in google_english_apps[1:]:
    name = app[0]
    price = app[7]
    if price == '0' and name not in free_google_apps:
        free_google_apps.append(app)
        
# print the number of records for free apps
print(len(free_apple_apps))
print(len(free_google_apps))

3222
8863


# Analysis

We begin the analysis by creating a frequency table to understand what the most common categories of apps are for both Apple and Google.

In [36]:
def generate_ft(dataset, index):
    dictionary = {}
    for app in dataset:
        column = app[index]
        if column in dictionary:
            dictionary[column] += 1
        else:
            dictionary[column] = 1
            
    return dictionary

def display_table(dataset, index):
    table = generate_ft(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])            

ft_apple_store = display_table(free_apple_apps, 11)
ft_google_store_data_category = display_table(free_google_apps, 1)
ft_google_store_data_genres = display_table(free_google_apps, 9)


Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4
FAMILY : 1676
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 56
COMICS : 55
BEAUTY : 53
Tools : 749
Entertainment : 538
Education : 474
Business : 407
Productivit

# Further Analysis

The next step is to understand the number of user ratings for each genre within the App Store, as well as the number of installs per category for Google Play. 

In [40]:
ft_installs_google = {}
ft_num_ratings_apple = {}

for app in free_apple_apps:
    genre = app[11]
    num_ratings = float(app[5])
    if genre in ft_num_ratings_apple:
        ft_num_ratings_apple[genre] += num_ratings
    else:
        ft_num_ratings_apple[genre] = num_ratings

for app in free_google_apps:
    category = app[1]
    num_installs = app[5]
    num_installs = num_installs.replace(',', '')
    num_installs = num_installs.replace('+', '')
    num_installs = float(num_installs)
    if category in ft_installs_google:
        ft_installs_google[category] += num_installs
    else:
        ft_installs_google[category] = num_installs
        
for record in ft_num_ratings_apple:
    print(record)
    print(ft_num_ratings_apple[record])
    print('\n')
    
for record in ft_installs_google:
    print(record)
    print(ft_installs_google[record])
    print('\n')

Lifestyle
840774.0


Medical
3672.0


Productivity
1177591.0


Games
42705967.0


Sports
1587614.0


Reference
1348958.0


Utilities
1513441.0


Weather
1463837.0


Education
826470.0


Catalogs
16016.0


Business
127349.0


Navigation
516542.0


Health & Fitness
1514371.0


Social Networking
7584125.0


Finance
1132846.0


Travel
1129752.0


Shopping
2261254.0


Food & Drink
866682.0


Book
556619.0


Photo & Video
4550647.0


News
913665.0


Music
3783551.0


Entertainment
3563577.0


VIDEO_PLAYERS
3931731720.0


PHOTOGRAPHY
4656268815.0


TOOLS
8101043474.0


PERSONALIZATION
1529235888.0


DATING
140914757.0


GAME
13436869450.0


HEALTH_AND_FITNESS
1143548402.0


NEWS_AND_MAGAZINES
2368196260.0


BEAUTY
27197050.0


FINANCE
455163132.0


COMMUNICATION
11036906201.0


TRAVEL_AND_LOCAL
2894704086.0


SHOPPING
1400338585.0


LIFESTYLE
497484429.0


WEATHER
360288520.0


PARENTING
31471010.0


EDUCATION
188850000.0


MEDICAL
37732344.0


FOOD_AND_DRINK
211738751.0


HOUSE_AND_HOME
9720