# Profitable analysis for Android and iOS mobile apps


Goal: The project aims to help the developers understand what type of apps are likely to attract more users based on the data driven analysis.

Method: The analysis is targeting solely *free* to download and install apps for *English-speaking* users; the main source of revenue consists of in-app ads.

Data set: Mobile apps from App Store / iOS and Google Play / Andriod.

Key parameter: The number of app users.

## Data

The sample data set of ~10,000 Android apps from Google Play is avaliable [here](https://www.kaggle.com/lava18/google-play-store-apps).

The sample data set of ~7,000 iOS apps from the App Store is avaliable [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

## Open the data

In [1]:
from csv import reader

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios_ds = list(read_file)
ios_head = ios_ds[0]
ios = ios_ds[1:]
ios_ln = len(ios_ds)-1
print('Lengthe of ios data without the header:', ios_ln)

opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android_ds = list(read_file)
android_head = android_ds[0]
android = android_ds[1:]
android_ln = len(android_ds)-1
print('Lengthe of android data without the header:', android_ln)

Lengthe of ios data without the header: 7197
Lengthe of android data without the header: 10841


To get an inside to the data sets use `explore_data()` function

In [2]:
def explore_data(dataset,start,end,rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
    return len(dataset[0]),len(dataset)

To explore the header of the data set use function `head`:

In [3]:
 def head(dataset):
    i = 0
    for column in dataset:
        print(str(i) + ' - ' + str(column))
        i += 1

Details about App Store / iOS data set with examples:

In [4]:
print('Apple Store header')
head(ios_head)
print('\n')
print('Apple Store insides')

ios_n_column, ios_n_row = explore_data(ios,0,2,True)

Apple Store header
0 - id
1 - track_name
2 - size_bytes
3 - currency
4 - price
5 - rating_count_tot
6 - rating_count_ver
7 - user_rating
8 - user_rating_ver
9 - ver
10 - cont_rating
11 - prime_genre
12 - sup_devices.num
13 - ipadSc_urls.num
14 - lang.num
15 - vpp_lic


Apple Store insides
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


Details about Google Play / Andriod data set with examples:

In [5]:
print('Android header')
head(android_head)
print('\n')
print('Android insides')

android_n_column, android_n_row = explore_data(android,0,2,True)

Android header
0 - App
1 - Category
2 - Rating
3 - Reviews
4 - Size
5 - Installs
6 - Type
7 - Price
8 - Content Rating
9 - Genres
10 - Last Updated
11 - Current Ver
12 - Android Ver


Android insides
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Data cleaning:

1) detect and remove defected data

2) detect and remove data with false rating

3) detect and remove dublicates

4) detect and remove unsuitable for the taks data including non-free and non-english-based apps

To detect row index with missing or additional data use function `catch_defect`:

In [6]:
def catch_defect(data_set,data_length):
    i = 0
    for row in data_set:
        if len(row) != data_length:
            print('Defected row is:', data_set.index(row))
            print(row)
            i += 1  
    if i==0:
        print('No defects found')
    else:
        print(i, 'defected rows found')

Defected rows in App Store / iOS data set:

In [7]:
catch_defect(ios, ios_n_column)

No defects found


Defected rows in Google Play / Andriod data set:

In [8]:
catch_defect(android, android_n_column)

Defected row is: 10472
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
1 defected rows found


Delete defected rows from Google Play / Andriod data set:

In [9]:
del android[10472]

In [10]:
catch_defect(android, android_n_column)

No defects found


To check if raiting is false use function `catch_rating`:

In [11]:
def catch_rating(data_set,ios_or_android):
    if ios_or_android == 'ios':
        i = 0
        for app in data_set:
            rating = float(app[7]) 
            if 0 > rating > 5:
                print('Row with wrong rating is:', data_set.index(app[7]))
                print(app[7])
                i += 1
                
        j = 0    
        for app in data_set:
            rating = float(app[8])  
            if 0 > rating > 5:
                print('Row with wrong rating is:', data_set.index(app[8]))
                print(app[8])
                j += 1
                
        if i==0 and j==0:
            print('No defects found')
        else:
            print(i+j, ' rows with wrong rating is found')
            
            
    elif ios_or_android == 'android':
        i = 0
        for app in data_set:
            rating = float(app[2])
            if 0 > rating > 5:
                print('Row with wrong rating is:', data_set.index(app[2]))
                print(app[2])
                i += 1 
        if i==0:
            print('No defects found')
        else:
            print(i, ' rows with wrong rating is found')

Rows with false rating in App Store / iOS and Google Play / Andriod data set:

In [12]:
catch_rating(ios,'ios')
print('\n')
catch_rating(android,'android')

No defects found


No defects found


There are some dublicates of the data that should be detected and removed. The key parameter to detect a dubliacte is the app *name*. The criterion to select one of the dublicated entry is *the number of reviews*, where the maximun number corresponds to the most recent information.

To detect the dublicates in data set and count its number of appearences use function `catch_dublicates`:

In [13]:
def catch_dublicates(data_set):
    dublicate_apps = []
    unique_apps = []
    for app in data_set:
        name = app[0]
        if name in unique_apps:
            dublicate_apps.append(name)
        else:
            unique_apps.append(name)
    print('Number of dublicate apps:', len(dublicate_apps))

Dublicates in App Store / iOS data set:

In [14]:
catch_dublicates(ios)

Number of dublicate apps: 0


Dublicates in Google Play / Andriod data set:

In [15]:
catch_dublicates(android)

Number of dublicate apps: 1181


To substract the unique data set:

In [16]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
        
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews    
len(reviews_max)

9659

To remove the dublicated rows based on *the maximum reviews* with the idea that the maximum reviewd entry corresponds to the most recent information about the app:

In [17]:
android_clean = [] 
already_added = [] 

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        
print(len(android_clean))

9659


The next step in data cleaning is to remove *non-English* apps.

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the **ASCII** (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

To detect *non-English* symbols (minimum allowed is 3) in any string use function `is_English`:

In [18]:
def is_english(string):
    n_nonenglish = 0
    for symbol in string:
        if ord(symbol) > 127:
            n_nonenglish += 1  
    
    if n_nonenglish > 3:
        return False
    else:
        return True   
        
print(is_english("Instagram"))
print(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(is_english("Docs To Go™ Free Office Suite"))
print(is_english("Instachat 😜"))

True
False
True
True


Rows with non-English apps in App Store / iOS data set:

In [19]:
ios_english = []

for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(ios_english, 0, 1, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 6183
Number of columns: 16


(16, 6183)

Rows with non-English apps in Google Play / Andriod data set:

In [20]:
android_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)

explore_data(android_english, 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


(13, 9614)

The final step in data cleaning is to remove *non-free* apps.

Rows with free apps in App Store / iOS data set:

In [21]:
ios_free = []

for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
        
explore_data(ios_free, 0, 1, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 3222
Number of columns: 16


(16, 3222)

Rows with free apps in Google Play / Andriod data set:

In [22]:
android_free = []

for app in android_english:
    price = app[7]
    if price == "0":
        android_free.append(app)
        
explore_data(android_free, 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 8864
Number of columns: 13


(13, 8864)

### Data cleaning is completed:

use **ios_free** and **android_free** data sets for the further analysis.

# The app idea:

The final product of this project is an app that fits both App Store and Google Play platforms. To develop the app we will learn from the example of the most successful apps on both markets.

Aim: determine the most common genres for each market. 


Key parameter: exaplore 'prime genre' (column [11]) in iOS and 'Genres' (column [9]) together with 'Category' (column [1]) in Android data sets and construct the frequency table.

To construct the frequency table from the data set use function `freq_table`:

In [23]:
def freq_table(data_set, index):
    frequency_table = {}
    total = 0
    for row in data_set:
        total += 1
        value = row[index]
        if value in frequency_table:
            frequency_table[value] += 1
        else:
            frequency_table[value] = 1
    precent_table = {}
    for key in frequency_table:
        precent_table[key] = (frequency_table[key] / total)*100
    return precent_table
        

To display the presentages in descending order use function `display_table`:

In [24]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

The frequency table for 'prime_genre' in App Store / iOS data set:

In [25]:
display_table(ios_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


About 70\% of apps offered on Apple Store market are designed for fun and entertainmenet, including games, entertainement and photo and video apps. To a lesser extent Apple Store offers apps for practical purposes.

The frequency table for 'Genres' in Google Play / Andriod data set:

In [26]:
display_table(android_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The frequency table for 'Category' in Google Play / Andriod data set:

In [27]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

According to the genre, the most offered apps in Google Play are for practial purposes composing about 30\% (such as tools, education, business, productivity, finances, lifestyle, medical), while entertainement is accounting for 6\% of apps.

According to the category, Google Play is mainly provides apps for family, games and tools.

The comparizon of apps by frequency reveals that Apple Store is mainly represented by fun genre as games, while Google Play has broader spectrum of genres designed more thowards practical purposes.

For each genre we will calculate **the average number of user ratings** first for the Apple Store.

In [28]:
genres_ios = freq_table(ios_free, 11)
print(genres_ios)
print('\n')
for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_free:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    av_user_rating = total/len_genre
    print(genre,':',av_user_rating)
    

{'Social Networking': 3.2898820608317814, 'Photo & Video': 4.9658597144630665, 'Games': 58.16263190564867, 'Music': 2.0484171322160147, 'Reference': 0.5586592178770949, 'Health & Fitness': 2.0173805090006205, 'Weather': 0.8690254500310366, 'Utilities': 2.5139664804469275, 'Travel': 1.2414649286157666, 'Shopping': 2.60707635009311, 'News': 1.3345747982619491, 'Navigation': 0.186219739292365, 'Lifestyle': 1.5828677839851024, 'Entertainment': 7.883302296710118, 'Food & Drink': 0.8069522036002483, 'Sports': 2.1415270018621975, 'Book': 0.4345127250155183, 'Finance': 1.1173184357541899, 'Education': 3.662321539416512, 'Productivity': 1.7380509000620732, 'Business': 0.5276225946617008, 'Catalogs': 0.12414649286157665, 'Medical': 0.186219739292365}


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
T

Based on the analyzed data for Apple Store the apps for entertainement are in priority. We would recomment to develop an app for Weather, as on the one hand many people rate them but there are not so many offered in the market.

In [29]:
for app in ios_free:
    if app[-5] == 'Weather':
        print(app[1], ':', app[5]) # print name and number of ratings

The Weather Channel: Forecast, Radar & Alerts : 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648
WeatherBug - Local Weather, Radar, Maps, Alerts : 188583
MyRadar NOAA Weather Radar Forecast : 150158
AccuWeather - Weather for Life : 144214
Yahoo Weather : 112603
Weather Underground: Custom Forecast & Local Radar : 49192
NOAA Weather Radar - Weather Forecast & HD Radar : 45696
Weather Live Free - Weather Forecast & Alerts : 35702
Storm Radar : 22792
QuakeFeed Earthquake Map, Alerts, and News : 6081
Moji Weather - Free Weather Forecast : 2333
Hurricane by American Red Cross : 1158
Forecast Bar : 375
Hurricane Tracker WESH 2 Orlando, Central Florida : 203
FEMA : 128
iWeather - World weather forecast : 80
Weather - Radar - Storm with Morecast App : 78
Yurekuru Call : 53
Weather & Radar : 37
WRAL Weather Alert : 25
Météo-France : 24
JaxReady : 22
Freddy the Frogcaster's Weather Station : 14
Almanac Long-Range Weather Forecast : 12
TodayAir

For each genre we will calculate **the average number of user ratings** for the Google Play.

In [30]:
display_table(android_free,5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


We will round the number of installs to the given number, note that we should remove the '+' sign or ',' characters to convert string into integer:

In [31]:
category_android = freq_table(android_free, 1)
print(category_android )
print('\n')
for category in category_android :
    total = 0
    len_category = 0
    for app in android_free:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += float(n_installs)
            len_category += 1
    av_n_instals = total/len_category
    print(category,':',av_n_instals)
    

{'ART_AND_DESIGN': 0.6430505415162455, 'AUTO_AND_VEHICLES': 0.9250902527075812, 'BEAUTY': 0.5979241877256317, 'BOOKS_AND_REFERENCE': 2.1435018050541514, 'BUSINESS': 4.591606498194946, 'COMICS': 0.6204873646209386, 'COMMUNICATION': 3.2378158844765346, 'DATING': 1.861462093862816, 'EDUCATION': 1.1620036101083033, 'ENTERTAINMENT': 0.9589350180505415, 'EVENTS': 0.7107400722021661, 'FINANCE': 3.7003610108303246, 'FOOD_AND_DRINK': 1.2409747292418771, 'HEALTH_AND_FITNESS': 3.0798736462093865, 'HOUSE_AND_HOME': 0.8235559566787004, 'LIBRARIES_AND_DEMO': 0.9363718411552346, 'LIFESTYLE': 3.9034296028880866, 'GAME': 9.724729241877256, 'FAMILY': 18.907942238267147, 'MEDICAL': 3.531137184115524, 'SOCIAL': 2.6624548736462095, 'SHOPPING': 2.2450361010830324, 'PHOTOGRAPHY': 2.944494584837545, 'SPORTS': 3.395758122743682, 'TRAVEL_AND_LOCAL': 2.33528880866426, 'TOOLS': 8.461191335740072, 'PERSONALIZATION': 3.3167870036101084, 'PRODUCTIVITY': 3.892148014440433, 'PARENTING': 0.6543321299638989, 'WEATHER': 

Based on the analyzed data for Google Play the practical apps are in priority. We would recomment to develop an app for Dating, as on the one hand many people rate them but there are not so many offered in the market.

In [32]:
for app in android_free:
    instals = app[5].replace('+','')
    instals = instals.replace(',','')
    if app[1] == 'DATING' and float(instals) >= 5000000:
        print(app[0], ':', app[5]) # print name and number of ratings

Match™ Dating - Meet Singles : 10,000,000+
Casual Dating & Adult Singles - Joyride : 5,000,000+
eharmony - Online Dating App : 5,000,000+
Free Dating App & Flirt Chat - Match with Singles : 5,000,000+
Find Real Love — YouLove Premium Dating : 10,000,000+
ChatVideo Meet new people : 5,000,000+
Chat Rooms, Avatars, Date - Galaxy : 10,000,000+
Hitwe - meet people and chat : 10,000,000+
iPair-Meet, Chat, Dating : 5,000,000+
Free Dating & Flirt Chat - Choice of Love : 5,000,000+
Moco - Chat, Meet People : 10,000,000+
Hot or Not - Find someone right now : 10,000,000+
OkCupid Dating : 10,000,000+
Zoosk Dating App: Meet Singles : 10,000,000+
