# Profitable App Profiles for the App Store and Google Play Markets
## This project tells which type of apps are the most profitable to inform companies that make apps. It uses data from the App Store and Google Play, such as number of downloads.

The `explore_data()` method allows us to dive into our data sets. dataset is expected to be a list of lists without a header row.

In [43]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

Open the data sets, print out the first few rows using the `explore_data()` method, and print out each dataset's headers to start identifying which data will be helpful for our analysis. See the documentation for [AppleStore.csv](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) and [google_play_store.csv](https://www.kaggle.com/lava18/google-play-store-apps/home) for further explanations of the data sets.

In [44]:
from csv import reader
apple_store = open(r'datasets/AppleStore.csv', encoding='utf8')
app_read = reader(apple_store)
apple = list(app_read)
google_store = open(r'datasets/googleplaystore.csv', encoding='utf8')
google_read = reader(google_store)
android = list(google_read)

explore_data(apple[1:],0,2,True)
explore_data(android[1:],0,2,True)

print(apple[0])
print(android[0])

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows:  7197
Number of columns:  17
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['App', 'Categor

## Data Cleaning
By reading the discussion section for the Google Play data set, we discovered that row 10473 (including header) is missing the category column, causing all other data to shift. I am choosing to delete this row. Reading the Apple Store discussion section, there are no recorded instances of incorrect data.

In [45]:
print(android[10473])
del android[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The Google Play data set contains multiple duplicate apps. We will find these and only keep the row with the most reviews (should be most recent).

In [46]:
dups = []
unique = []
for app in android:
    name = app[0]
    if name in unique:
        dups.append(name)
    else:
        unique.append(name)
print('Number of duplicate apps: ', len(dups))
print('\n')
print('Examples of duplicate apps: ', dups[:5])

Number of duplicate apps:  1181


Examples of duplicate apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [47]:
#find most recent duplicate
reviews_max = {}
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))
#remove all but most recent duplicate
android_clean = []
already_added = []
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
print(len(android_clean))

9659
9659


Since we are only interested in English apps, we need to remove any apps that have non-English characters in the titles. We can do this by looping through the name's characters and seeing if there are more than 3 characters outside of the common English ascii range. This allows English names with a couple special characters to still be detected as English while still detecting non-English names.

In [48]:
def isEnglish(name):
    num = 0
    for char in name:
        if ord(char)>127:
            num += 1
    if num>3:
        return False
    return True

print(isEnglish('Instagram'))
print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish('Docs To Go™ Free Office Suite'))
print(isEnglish('Instachat 😜'))

True
False
True
True


Now we will filter out non-English apps from both data sets.

In [49]:
android_eng = []
apple_eng = []
for app in android_clean:
    name = app[0]
    if isEnglish(name):
        android_eng.append(app)
for app in apple:
    name = app[0]
    if isEnglish(name):
        apple_eng.append(app)
print(len(android_eng))
print(len(apple_eng))

9614
7198


We are only focused on apps that are free, where revenue comes from ads. Therefore we must remove all apps that are not free. Referring to the column headers, price is index 5 for the apple data set and index 6 for the google play data set. By analyzing these respective columns, we see that free apps in the android data set are referred to as `'Free'` while free apple apps have a price value of `'0'`

In [50]:
print(android_eng[:2])
print(apple_eng[4:6])

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']]
[['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'], ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']]


In [51]:
android_free = []
apple_free = []
for app in android_eng:
    price = app[6]
    if price == 'Free':
        android_free.append(app)
for app in apple_eng:
    price = app[5]
    if price == '0':
        apple_free.append(app)
print(len(android_free))
print(len(apple_free))

8863
4056


# Data Analysis
Since the company is planning to publish the app to both Google Play and the App Store, we need to find types of apps that are profitable on both. We will build frequency tables to find the most common genres.

In [52]:
#creates a frequency table for any column in any data set
def freq_table(dataset,index):
    freq = {}
    for app in dataset:
        var = app[index]
        if var in freq:
            freq[var] += 1
        else:
            freq[var] = 1
    for val in freq:
        freq[val] = freq[val]/len(dataset)*100
    return freq

#displays table in descending order of %
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(android_free,1) #Category
display_table(android_free,9) #Genres
display_table(apple_free,12)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

The most common genre on the App Store is Games. The runner-up is Entertainment. It seems that most apps are made for entertainment rather than practical reasons. The popularity of `Games` does not imply that they have the largest number of users.
Google Play seems to have more practical apps, with `Tools` being the most popular Genre. However, games are split into particular types of games, so it is hard to tell whether or not gaming apps are more common.

To find which genres have the most users rather than just most apps, we will use the `Installs` column from the Google Play data and the `rating_count_tot` column from the App Store data, since this data does not include the number of installs.

In [53]:
apple_genre_freq = freq_table(apple_free, 12)
for genre in apple_genre_freq:
    total = 0
    len_genre = 0
    for app in apple_free:
        genre_app = app[12]
        if genre_app == genre:
            n_ratings = float(app[6])
            total += n_ratings
            len_genre += 1
    avg = total/len_genre
    print(genre, ':', '{:,.2f}'.format(avg))

Productivity : 19,053.89
Weather : 47,220.94
Shopping : 18,746.68
Reference : 67,447.90
Finance : 13,522.26
Music : 56,482.03
Utilities : 14,010.10
Travel : 20,216.02
Social Networking : 53,078.20
Sports : 20,128.97
Health & Fitness : 19,952.32
Games : 18,924.69
Food & Drink : 20,179.09
News : 15,892.72
Book : 8,498.33
Photo & Video : 27,249.89
Entertainment : 10,822.96
Business : 6,367.80
Lifestyle : 8,978.31
Education : 6,266.33
Navigation : 25,972.05
Medical : 459.75
Catalogs : 1,779.56


Refernce, Music, and Social Networking apps have the highest average number of reviews on the App Store, so they would be good options for the company.

Now we will analyze the Google Play data. Notice the number of installs are imprecise (100,000+, 1,000,000+, etc). We will process these numbers without the '+' (ex: 100,000; 1,000,000). Although this is not precise, it will give us a general idea of the number of installs for each genre.

In [54]:
android_cat_freq = freq_table(android_free,1)
for cat in android_cat_freq:
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == cat:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_genre += 1
    avg = total/len_genre
    print(cat, ':', '{:,.2f}'.format(avg))

ART_AND_DESIGN : 1,715,471.21
AUTO_AND_VEHICLES : 358,649.06
BEAUTY : 135,308.71
BOOKS_AND_REFERENCE : 4,260,573.55
BUSINESS : 873,310.89
COMICS : 52,721.16
COMMUNICATION : 9,681,496.67
DATING : 107,980.66
EDUCATION : 134,126.42
ENTERTAINMENT : 662,732.75
EVENTS : 10,265.53
FINANCE : 241,594.02
FOOD_AND_DRINK : 106,187.94
HEALTH_AND_FITNESS : 504,432.47
HOUSE_AND_HOME : 41,539.51
LIBRARIES_AND_DEMO : 21,871.98
LIFESTYLE : 179,662.13
GAME : 3,700,597.48
FAMILY : 1,167,338.05
MEDICAL : 6,715.14
SOCIAL : 937,294.94
SHOPPING : 231,307.99
PHOTOGRAPHY : 737,334.73
SPORTS : 165,542.73
TRAVEL_AND_LOCAL : 424,256.79
TOOLS : 1,069,727.12
PERSONALIZATION : 194,386.16
PRODUCTIVITY : 705,264.16
PARENTING : 3,805.44
WEATHER : 43,194.88
VIDEO_PLAYERS : 462,556.67
NEWS_AND_MAGAZINES : 270,712.88
MAPS_AND_NAVIGATION : 56,702.07


Communication, Game, and Books & Reference apps have the most installs on Google Play, so these would be good apps for the company to develop.

Looking at the most popular genres from both the App Store and Google Play, the company should develop an app that combines references and communication/social interaction. For example, this app could allow users to discuss different sources of information in forums.