# Profitable App for iOS & Android

## What project is about

The project is to find a potential application that can make profit on both iOS and Android platforms. This application is free to download and in English.

## What my goal is in this project

My goal is able to tell developers what kind of application should they build to attract as much as possible users.

## I. Opening and Exploring the Data:

There are two data sets on [kaggle](wwww.kaggle.com) that I can use for my analysis:
* [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps/home) which contains 9960 apps.
* [Mobile App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) which contains 7200 apps.

Let's open them:

In [1]:
from csv import reader

#Google Play data set:
opened_file = open('../input/google-play-store-apps/googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android_body = android[1:]

#Apple Store data set:
opened_file = open('../input/app-store-apple-data-set-10k-apps/AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios_body = ios[1:]

To explore these data sets easily, I will create a function called `f_explore_data`:

In [2]:
def f_explore_data(dataset, start, end, no_rows_columns = False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n')
    if no_rows_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [3]:
print(android_header)
print('\n')
f_explore_data(android_body, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


Number of rows (10841 rows) of Google Play data set doesn't match number of unique apps (9960 apps) I mentioned above. It means there are bunch of duplications in this data.

In [4]:
print(ios_header)
print('\n')
f_explore_data(ios_body, 0, 3, True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows:  7197
Number of columns:  17


Same issue here. Number of rows is greater than number of unique apps.

Some column names are not self-explanatory. For example: 'track_name', 'rating_count_tot', ect. We can find their detail info as below:

* [Google Play documentary](https://www.kaggle.com/lava18/google-play-store-apps)

* [Apple Store documentary](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

## II. Cleaning the Data:
### 1. Deleting wrong data:

Before removing duplicate rows, I need to dig deeper to see whether we have any potential problems with our data sets. Fortunately, in both data sets, there are discussion sections that can save me time to detect those errors.
* [Goolge Play Discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion)
* [Apple Store Discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion)

Read through all discussions, I only found [one](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) that describes row 10472 is wrong.

In [5]:
print(android_header)
print('\n')
print(android_body[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Rating in this row is 19 which is wrong because maximum rating of Google Play App is 5. As a consequence, it will be deleted.

In [6]:
del android_body[10472]

### 2. Removing duplicate row:
We already knewn our data sets have so many duplications. For instance, Instagram has four entries:

In [7]:
print(android_header)
print('\n')
for row in android_body:
    v_app_name = row[0]
    if v_app_name == 'Instagram':
        print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


I will build function `f_find_duplication` to find how many duplications are there in each data set

In [8]:
def f_find_duplication(dataset, index):
    l_duplicate_app = []
    l_unique_app = []
    
    for row in dataset:
        v_app_name = row[index]
        if v_app_name in l_unique_app:
            l_duplicate_app.append(v_app_name)
        else:
            l_unique_app.append(v_app_name)
    
    print('Number of duplicate apps: ', len(l_duplicate_app))
    print('Number of unique apps: ', len(l_unique_app))

In [9]:
f_find_duplication(android_body, 0)

Number of duplicate apps:  1181
Number of unique apps:  9659


Google Play data set has 9659 unique apps. This is correct because I deleted one row due to error.

In [10]:
f_find_duplication(ios_body, 1)

Number of duplicate apps:  0
Number of unique apps:  7197


Apple Store data set doesn't have duplicate apps.

My task is clear: removing 1181 duplicate apps from Google Play data set. But randomly removing them is not a good choice. Look closely Instagram's entries, we can see four different number of reviews. This issue could also happen to other apps. What I want is to keep greatest number of reviews because it seems latest entry of an app.  

In [11]:
d_reviews_max = {}

for row in android_body:
    v_app_name = row[0]
    v_reviews_max = int(row[3])
    if (v_app_name in d_reviews_max) and (v_reviews_max > d_reviews_max[v_app_name]):
        d_reviews_max[v_app_name] = v_reviews_max
    elif v_app_name not in d_reviews_max:
        d_reviews_max[v_app_name] = v_reviews_max

print('Number of unique apps in d_reviews_max dictionary: ', len(d_reviews_max))

Number of unique apps in d_reviews_max dictionary:  9659


Let's double check Instagram's reviews in `d_reviews_max`:

In [12]:
print('Number of Instagram\'s reviews: ',d_reviews_max['Instagram'])

Number of Instagram's reviews:  66577446


So far, so good. Now I will use the dictionary to remove duplications.

In [13]:
l_android_clean = []
already_added = []

for row in android_body:
    v_app_name = row[0]
    v_reviews_max = int(row[3])
    if (v_app_name not in already_added) and (v_reviews_max == d_reviews_max[v_app_name]):
        l_android_clean.append(row)
        already_added.append(v_app_name)
        
print('Number of unique apps in l_android_clean: ',len(l_android_clean))

Number of unique apps in l_android_clean:  9659


Double check with Instagram's reviews in `l_android_clean`:

In [14]:
print(android_header)
print('\n')
for row in l_android_clean:
    v_app_name = row[0]
    if v_app_name == 'Instagram':
        print(row)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Everything is as I expected. All unwanted rows were removed. The next step is removing non-English applications.

### 3. Removing non-English apps:

I can use built-in function [ord()](https://docs.python.org/3/library/functions.html#ord). This function will return integer of a character. English words are made of 127 standard characters called [ASCII code](https://ascii.cl/) (abbreviated from [American Standard Code for Information Interchange](https://en.wikipedia.org/wiki/ASCII)). To filter non-english apps, we can create a function `f_is_english` that will return `Fales` if output of `ord(input_string)` greater than 127 and vice versa. 

In [15]:
def f_is_english(v_string):
    for v_character in v_string:
        if ord(v_character) > 127:
            return False
    return True

In [16]:
f_is_english('Instagram')

True

In [17]:
f_is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [18]:
f_is_english('Docs To Go™ Free Office Suite')

False

In [19]:
f_is_english('Instachat 😜')

False

I did some tests and could see my fuction is not good enough. It cannot dectect 'Docs To Go™ Free Office Suite' and 'Instachat 😜' are english apps due to `™` trade mark and emoji. To help the funtion recognizes these apps as english apps, I would add one more condition to `f_is_english`:

In [20]:
def f_is_english(v_string):
    v_count = 0
    
    for character in v_string:
        if ord(character) > 127:
            v_count += 1
            if v_count > 3:
                return False
    return True

In [21]:
f_is_english('Docs To Go™ Free Office Suite')

True

In [22]:
f_is_english('Instachat 😜')

True

In [23]:
f_is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

Now let's use this funtion to filter both data sets:

In [24]:
l_android_english = []
l_ios_english = []

for row in l_android_clean:
    v_app_name = row[0]
    if f_is_english(v_app_name):
        l_android_english.append(row)
        
for row in ios_body:
    v_app_name = row[2]
    if f_is_english(v_app_name):
        l_ios_english.append(row)
        
print('Number of english Android apps: ', len(l_android_english))
print('Number of english iOS apps: ', len(l_ios_english))

Number of english Android apps:  9614
Number of english iOS apps:  6183


Up until now, I have done:
* Deleted wrong entry
* Removed duplicate entries
* Filtered non-english apps

The final step of cleaning process is isolating free apps from data sets.

### 4. Isolating the Free Apps:

In [25]:
l_android_final = []
l_ios_final = []

for row in l_android_english:
    price = row[7]
    price = price.replace('$','')
    price = price.replace('Everyone', '0')
    price = float(price)
    if price == 0.0:
        l_android_final.append(row)
        
for row in l_ios_english:
    price = float(row[5])
    if price == 0.0:
        l_ios_final.append(row)
        
print('Number of free Android apps: ',len(l_android_final))
print('Number of free iOS apps: ',len(l_ios_final))

Number of free Android apps:  8864
Number of free iOS apps:  3222


I've finished with cleaning data sets. There are 8864 Android and 3222 iOS apps left. Let's begin our analysis to find out what kind of app the best for making money.

## II. Analysing the Data:
### 1. Most Common Apps by Genre:

I will need to build two functions to answer the question "What are the most common Apps by Genre?":
* First, a function `f_freq_table` to show frequency in percentage of each Genre.
* Second, a function `f_freq_table_desc` to show percentages in a descending order.

In [26]:
def f_freq_table(dataset, index):
    d_freq_table = {}
    
    for row in dataset:
        v_genre = row[index]
        if v_genre in d_freq_table:
            d_freq_table[v_genre] += 1
        else:
            d_freq_table[v_genre] = 1
    
    d_freq_table_percent = {}
    for key in d_freq_table:
        v_genre_percent = (d_freq_table[key] / len(dataset)) * 100
        d_freq_table_percent[key] = v_genre_percent
        
    return d_freq_table_percent

def f_freq_table_desc(dataset, index):
    d_table = f_freq_table(dataset, index)
    l_table_display = []
    
    for key in d_table:
        v_tuple = (float(d_table[key]), key)
        l_table_display.append(v_tuple)
        
    l_table_display = sorted(l_table_display, reverse = True)
    for row in l_table_display:
        print(row[1], ':', row[0])

In [27]:
f_freq_table_desc(l_android_final, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [28]:
f_freq_table_desc(l_ios_final, 12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Top 5 Most Common Apps by Genre:

|Top|Android  |iOS              |
|---|---------|-----------------|
|1  |Family   |Games            |
|2  |Game     |Entertainment    |
|3  |Tools    |Photo & Video    |
|4  |Business |Education        |
|5  |Lifestyle|Social Networking|

Game looks like the most common genre on both platforms. Right now, building a game application might be a good choice. It doesn't cost many resources. Flappy Bird for example. Not only graphic and content are very simple, but also very addicted.

However, there is still one more thing that makes me wonder: How are popular these genres?. This question leads me to another analysis **Most Popular Apps by Genre**

### 2. Most Popular Apps by Genre:
#### Part 1: Most Popular Apps by Genre on Google Play:

I can use column Installs to figure out popularity of each genre. 

In [29]:
d_popular_genre = {}

for row in l_android_final:
    v_genre = row[1]
    v_installs = row[5]
    v_installs = v_installs.replace('+','')
    v_installs = v_installs.replace(',','')
    v_installs = int(v_installs)
    if v_genre in d_popular_genre:
        d_popular_genre[v_genre] += v_installs
    else:
        d_popular_genre[v_genre] = v_installs
        
l_table_display = []

for key in d_popular_genre:
    v_tuple = (d_popular_genre[key], key)
    l_table_display.append(v_tuple)
    
l_table_display = sorted(l_table_display, reverse = True)
for row in l_table_display:
    print(row[1], ':', row[0])

GAME : 13436869450
COMMUNICATION : 11036906201
TOOLS : 8101043474
FAMILY : 6193895690
PRODUCTIVITY : 5791629314
SOCIAL : 5487861902
PHOTOGRAPHY : 4656268815
VIDEO_PLAYERS : 3931731720
TRAVEL_AND_LOCAL : 2894704086
NEWS_AND_MAGAZINES : 2368196260
BOOKS_AND_REFERENCE : 1665884260
PERSONALIZATION : 1529235888
SHOPPING : 1400338585
HEALTH_AND_FITNESS : 1143548402
SPORTS : 1095230683
ENTERTAINMENT : 989460000
BUSINESS : 696902090
MAPS_AND_NAVIGATION : 503060780
LIFESTYLE : 497484429
FINANCE : 455163132
WEATHER : 360288520
FOOD_AND_DRINK : 211738751
EDUCATION : 188850000
DATING : 140914757
ART_AND_DESIGN : 113221100
HOUSE_AND_HOME : 97202461
AUTO_AND_VEHICLES : 53080061
LIBRARIES_AND_DEMO : 52995810
COMICS : 44971150
MEDICAL : 37732344
PARENTING : 31471010
BEAUTY : 27197050
EVENTS : 15973160


#### Part 2: Most Popular Apps by Genre on Apple Store:

Apple Store data set doens't tell us number of installations for each genre. I will use rating_count_tot (User Rating counts for all version) instead:

In [30]:
d_popular_genre = {}

for row in l_ios_final:
    v_genre = row[12]
    v_ratings = int(row[6])
    if v_genre in d_popular_genre:
        d_popular_genre[v_genre] += v_ratings
    else:
        d_popular_genre[v_genre] = v_ratings
        
l_table_display = []

for key in d_popular_genre:
    v_tuple = (d_popular_genre[key], key)
    l_table_display.append(v_tuple)
    
l_table_display = sorted(l_table_display, reverse = True)
for row in l_table_display:
    print(row[1], ':', row[0])

Games : 42705967
Social Networking : 7584125
Photo & Video : 4550647
Music : 3783551
Entertainment : 3563577
Shopping : 2261254
Sports : 1587614
Health & Fitness : 1514371
Utilities : 1513441
Weather : 1463837
Reference : 1348958
Productivity : 1177591
Finance : 1132846
Travel : 1129752
News : 913665
Food & Drink : 866682
Lifestyle : 840774
Education : 826470
Book : 556619
Navigation : 516542
Business : 127349
Catalogs : 16016
Medical : 3672


Top 5 Most Popular Apps by Genre:

|Top|Android      |iOS              |
|---|-------------|-----------------|
|1  |Game         |Games            |
|2  |Communication|Social Networking|
|3  |Tools        |Photo & Video    |
|4  |Family       |Music            |
|5  |Productivity |Entertainment    |

Top 5 Most Common Apps by Genre:

|Top|Android  |iOS              |
|---|---------|-----------------|
|1  |Family   |Games            |
|2  |Game     |Entertainment    |
|3  |Tools    |Photo & Video    |
|4  |Business |Education        |
|5  |Lifestyle|Social Networking|

Based on the two tables above, it reinforces my idea of creating a game application. The last thing I would like to know is: What kind of game does attract the most users?

### 3. Most Popular Game Apps on Google Play and Apple Store:

In [31]:
##Google Play
d_popular_game = {}

for row in l_android_final:
    v_app_name = row[0]
    v_genre = row[1]
    v_installs = row[5]
    v_installs = v_installs.replace('+','')
    v_installs = v_installs.replace(',', '')
    v_installs = int(v_installs)
    if v_genre == 'GAME':
        d_popular_game[v_app_name] = v_installs
        
l_popular_game = []

for key in d_popular_game:
    v_tuple = (d_popular_game[key], key)
    l_popular_game.append(v_tuple)

l_popular_game = sorted(l_popular_game, reverse = True)

for row in l_popular_game[:6]:
    print(row[1], ':',row[0])

Subway Surfers : 1000000000
Temple Run 2 : 500000000
Pou : 500000000
My Talking Tom : 500000000
Candy Crush Saga : 500000000
slither.io : 100000000


In [32]:
##Apple Store
d_popular_game = {}

for row in l_ios_final:
    v_app_name = row[2]
    v_genre = row[12]
    v_ratings = int(row[6])
    if v_genre == 'Games':
        d_popular_game[v_app_name] = v_ratings
        
l_popular_game = []

for key in d_popular_game:
    v_tuple = (d_popular_game[key], key)
    l_popular_game.append(v_tuple)

l_popular_game = sorted(l_popular_game, reverse = True)

for row in l_popular_game[:6]:
    print(row[1], ':',row[0])

Clash of Clans : 2130805
Temple Run : 1724546
Candy Crush Saga : 961794
Angry Birds : 824451
Subway Surfers : 706110
Solitaire : 679055


Top 5 Most Popular Games on Google Play:

|Top|Android         |Game Genres      |
|---|----------------|-----------------|
|1  |Subway Surfers  |Arcade           |
|2  |Temple Run 2    |Arcade           |
|3  |Pou             |Casual           |
|4  |My Talking Tom  |Casual           |
|5  |Candy Crush Saga|Casual           |

Top 5 Most Popular Games on Apple Store:

|Top|iOS             |Game Genres      |
|---|----------------|-----------------|
|1  |Clash of Clans  |Strategy         |
|2  |Temple Run      |Arcade           |
|3  |Candy Crush Saga|Casual           |
|4  |Angry Birds     |Casual           |
|5  |Subway Surfers  |Arcade           |


## III. Conclusion:

Arcade and Casual are the most popular game genre on both Android and iOS. But Pou and My Talking Tom are most likely for children who might not really caught by ads. Candy Crush is a dominator in its category. Arcade game would be the best choice.
Subway Surfers, Temple Run are endless runner mobile games. They are easy to play, nice graphic and users must pay attention during gameplay which is a good thing because they could see our ads.

Finally, what kind of app my developers should build:
* Arcade game
* Nice graphic
* Easy and fun to play like Subway Surfers, Temple Run, Candy Crush.
* The longer users play, the more challenging it is like Flappy Bird

*The purpose of this project is mainly to practice what I have learned from [dataquest.io](dataquest.io) - Python for Data Science: Fundamental course. Many techniques, contents in this project were guided by dataquest.io and the following [solution](https://github.com/dataquestio/solutions/blob/master/Mission350Solutions.ipynb).*