# In-App Revenue Analysis for Free Apps

We will be analysing free Apps availalble in the AppleStore or GooglePlay. We will focus on free iOS and Android Apps that depend on in-app ads for revenue generation. We will also try to focus on English only Apps.

The goal of this project is to analyse which types of apps generate the most revenue and share these insights with our developers.

In [1]:
from csv import reader

# opening and reading the iOS app data to list 
file = open('AppleStore.csv', encoding='utf8')
read_file = reader(file)
ios = list(read_file)

# opening and reading the android app data to list
file = open('googleplaystore.csv', encoding='utf8')
an_read_file = reader(file)
android = list(an_read_file)

In [2]:
# function to explore dataset and optionally print dataset shape
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
              
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(ios,0,2, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


Number of rows: 7198
Number of columns: 17


### data cleaning 1 - removing erroneous records

Reviewing the dicussions on this dataset reveals there is an error with the data on row 10473, which is missing the type of app. We can either fill out the type, if we know it, or remove it for analysis. We will remove it below.

In [4]:
explore_data(android, 10471, 10474, True)

['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
android[:1]

[['App',
  'Category',
  'Rating',
  'Reviews',
  'Size',
  'Installs',
  'Type',
  'Price',
  'Content Rating',
  'Genres',
  'Last Updated',
  'Current Ver',
  'Android Ver']]

In [6]:
# the app on row 10473 'Life Made WI-Fi Touchscreen Photo Fram' has a missing value 
# for the app type so we will be removing it for this analysis
del android[10473]

### data cleaning 2 - removing duplicates

Exploring the dataset further reveals the android dataset contains multiple duplicate values for certain apps. Below we build a function that counts the numbe of unique app names and duplicate apps.

When we review a few duplicate values we see that they are not exact duplicates. The total number of reviews is different on each record and we will keep the record with the most reviews since this should be the most current record. 

In [7]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Number of unique apps:', len(unique_apps))

Number of duplicate apps: 1181


Number of unique apps: 9660


In [8]:
# we build an empty dictionary to populate it with the unique app names with the maximum 
# number of reviews
reviews_max = {}

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    if name not in reviews_max:
        reviews_max[name] = n_reviews

In [9]:
# checking to see if we get the expected number of records
print('We should have these many unique records', len(android) - 1181)

print('We have these many records after cleaning the duplicates', len(reviews_max))

We should have these many unique records 9660
We have these many records after cleaning the duplicates 9659


In [10]:
# removing duplicate rows using the new dictionary
# we will create a clean dataset by comparing the original list to the new dictionary 
# and only selecting the app record with the maximum number of reviews
android_clean = []
already_added = []

for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)  

In [11]:
len(android_clean)

9659

In [12]:
explore_data(android_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


### data cleaning 3 - removing non-English Apps 

The dataset does not contain a feature for app language so we will need to idenfiy a differnet way to remove non-English apps. Some apps contain characters not commonly used in the English language so we will use those characters to identify Apps that are non-English to remove.

The characters commonly used in the English text are all in the range of 0 to 127, according to ASCII (American Standard Code for Information Interchange)

In [13]:
string = 'zda'

In [14]:
ord('z')

122

In [15]:
# function to check if character is outside the English language
# we will use it to eliminate any apps with 3+ non-eng characters; not perfect
# the `ord()` function returns the number value of the character

def is_eng(string):
    non_eng_char = 0
    
    for char in string:
        if ord(char) > 127:
            non_eng_char += 1
    
    if non_eng_char > 3:
        return False
    else:
        return True

In [16]:
is_eng('Instagram')

True

In [17]:
is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [18]:
is_eng('Instachat 😜')

True

In [19]:
is_eng('Docs To Go™ Free Office Suite')

True

In [20]:
print(ord('😜'))

128540


In [21]:
# cleaning both datasets (ios and android_clean)
ios_eng = []
android_eng = []

for app in ios:
    name = app[1]
    if is_eng(name):
        ios_eng.append(app) 
        
for app in android_clean:
    name = app[0]
    if is_eng(name):
        android_eng.append(app)

In [22]:
print(len(ios))
print(len(ios_eng))

print(len(android_clean))
print(len(android_eng))

7198
7198
9659
9614


### data cleaning step 4 - removing non-free apps

Our analysis is around the in-app ads in free apps so we need to isolate the free apps.

In [27]:
# price is column 4 in ios
# price is column 6 in android
#ios_eng[:3]
android_eng[:2]

[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up'],
 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
  'ART_AND_DESIGN',
  '4.7',
  '87510',
  '8.7M',
  '5,000,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'August 1, 2018',
  '1.2.4',
  '4.0.3 and up']]

In [45]:
ios_eng[1:2]

[['1',
  '281656475',
  'PAC-MAN Premium',
  '100788224',
  'USD',
  '3.99',
  '21292',
  '26',
  '4',
  '4.5',
  '6.3.5',
  '4+',
  'Games',
  '38',
  '5',
  '10',
  '1']]

In [33]:
android_eng_free = []

for app in android_eng:
    price = app[6]
    
    if price == "Free":
        android_eng_free.append(app)

In [67]:
android_eng_free[400:402]

[['Heart mill',
  'DATING',
  '3.3',
  '4631',
  '45M',
  '100,000+',
  'Free',
  '0',
  'Mature 17+',
  'Dating',
  'August 4, 2018',
  '5.2.14',
  '5.0 and up'],
 ['Mutual - LDS Dating',
  'DATING',
  '3.7',
  '1439',
  '38M',
  '50,000+',
  'Free',
  '0',
  'Mature 17+',
  'Dating',
  'July 30, 2018',
  '1.1.46',
  '4.2 and up']]

In [35]:
#checking length to see how many android apps are "free"
len(android_eng_free)

8863

In [57]:
ios_eng_free = []

for app in ios_eng[1:]:
    price = float(app[5])
    
    if price == 0:
        ios_eng_free.append(app)
    

In [98]:
ios_eng_free[4:5]

[['7',
  '283646709',
  'PayPal - Send and request money safely',
  '227795968',
  'USD',
  '0',
  '119487',
  '879',
  '4',
  '4.5',
  '6.12.0',
  '4+',
  'Finance',
  '37',
  '0',
  '19',
  '1']]

In [82]:
print("android_eng_free:", len(android_eng_free))
print("ios_eng_free:", len(ios_eng_free))

android_eng_free: 8863
ios_eng_free: 4056


### The next step is to identify potential Apps that may perform well in both Android and iOS

We will first evaluate the types of free genres that perform well in both platforms. We will do this by building frequency tables to get a sense of apps in each platform.

In [68]:
# genre column is #12 in iOS dataset
# genre column is #1 in Android dataset

In [109]:

def freq_table(dataset, index):
    table = {}
    count = 0
    
    for row in dataset:
        count += 1
        key = row[index]
        
        if key in table:
            table[key] += 1
        else:
            table[key] = 1
            
    percentages = {}
    for key in table:
        percent = (table[key] / count) * 100
        percentages[key] = percent
        
    return percentages
    
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [93]:
# frequency percentage table for iOS genres
ios_genre = display_table(ios_eng_free, 12)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


In [92]:
# frequency percentage table for Android Categories
android_category = display_table(android_eng_free, 1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In [97]:
# frequency percentage table for Android genres
#android_genre = display_table(android_eng_free, 9)

### We will now analyze which genres are most popular

We will use the number of installs from the Android dataset and the total count of ratings on the iOS Apps (since we don't have install counts for iOS)

Ratings is column #6 for ios; category is #12

In [140]:
ios_genres = freq_table(ios_eng_free, 12)

for genre in ios_genres:
    total = 0 
    len_genre = 0
    for app in ios_eng_free:
        genre_app = app[12]
        if genre_app == genre:
            rating_tot = float(app[6])
            total += rating_tot
            len_genre += 1   
    avg_rat = total / len_genre
    
    print(genre, ":", avg_rat)
    

Productivity : 19053.887096774193
Weather : 47220.93548387097
Shopping : 18746.677685950413
Reference : 67447.9
Finance : 13522.261904761905
Music : 56482.02985074627
Utilities : 14010.100917431193
Travel : 20216.01785714286
Social Networking : 53078.195804195806
Sports : 20128.974683544304
Health & Fitness : 19952.315789473683
Games : 18924.68896765618
Food & Drink : 20179.093023255813
News : 15892.724137931034
Book : 8498.333333333334
Photo & Video : 27249.892215568863
Entertainment : 10822.961077844311
Business : 6367.8
Lifestyle : 8978.308510638299
Education : 6266.333333333333
Navigation : 25972.05
Medical : 459.75
Catalogs : 1779.5555555555557


### iOS App results

Based on the frequency table above, it would be beneficial to focus on a Weather app since it would be easier to develop and resuls in a high quanity of ratings and it would probably be easiest to deploy

In [157]:
# android does not have a clear number of installs and are formatted as strings with commas and '+'
# we will strip the comma and '+' and convert the count to a float to do calculations
display_table(android_eng_free, 5)

1,000,000+ : 15.728308699086089
100,000+ : 11.55365000564143
10,000,000+ : 10.549475346947986
10,000+ : 10.199706645605326
1,000+ : 8.394448832223853
100+ : 6.916393997517771
5,000,000+ : 6.826131106848697
500,000+ : 5.562450637481666
50,000+ : 4.772650344127271
5,000+ : 4.513144533453684
10+ : 3.542818458761142
500+ : 3.2494640640866526
50,000,000+ : 2.3017037120613786
100,000,000+ : 2.1324607920568655
50+ : 1.9180864267178157
5+ : 0.7898002933543946
1+ : 0.5077287600135394
500,000,000+ : 0.270788672007221
1,000,000,000+ : 0.2256572266726842
0+ : 0.045131445334536835


In [158]:
android_genres = freq_table(android_eng_free, 1)

for category in android_genres:
    total = 0
    len_category = 0
    
    for app in android_eng_free:
        category_app = app[1]
        if category_app == category:
            installs = app[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            
            total += installs
            len_category += 1
            
    avg_cat = total / len_category
    
    print(category, ':', avg_cat)


ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3697848.1731343283
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

### Android results

Based on the frequency table above, the recommendation would be to also focus on the Weather App. The goal is to develope an App that will end up being successful in both iOS and Android and based on this analysis Weather Apps may have the highest ROI.