P02 D Panchal

---
- Python using List of List, without help of NumPy or Pandas
---
# Which type of mobile app is more profitable - development insight

### Based on data from existing apps on Apple store and Google Store

The future apps to be developed can be from a wide range of types. These might be similar in terms of development cost. However there are varying degree of return based on usage, because the main source of revenue comes from in-app ads.

Overall sucess would therefore depend on decision about type of apps. What if this decision was informed by data of existing apps. From analysing data about currently existing apps, we will try to gain the key insight that directly helps us in choosing the categories of apps that will have better chances of success. 

Our **goal** in this project is to come up with *recommended set of parameters* for future development of apps, from the info of existing apps.







In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Reading two datasets Apple and Google

In [3]:
open_file1 = open('AppleStore.csv')
from csv import reader
read_file1 = reader(open_file1)
apps_data1 = list(read_file1)

open_file2 = open('googleplaystore.csv')
read_file2 = reader(open_file2)
apps_data2 = list(read_file2)


Below is example of a row, it shows `data type` of info contained in each column.

In [4]:
#explore_data(apps_data1,0,2)
#explore_data(apps_data2,0,2)
print(apps_data2[10473])




['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


    ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
    ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']
    ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
    ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


### Potential problems with Google data set

There could be duplicate entries in this data. We need to check for **duplicates**. From duplicates, we should keep the one with largest number of reviews, as it will be most latest entry.

In below we found there are 1181 duplicate entries.

In [5]:
duplicate_apps = []
unique_apps = []

for app in apps_data2[1:]:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps', duplicate_apps[:10])


Number of duplicate apps:  1181


Examples of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


In [6]:
print('Expected length', len(apps_data2)-1181-1)


Expected length 9660


### Highest rating for each app
Here in below we are making a list of maximum ratings of each app name. This wil be used later to filter out the duplicates with less than max ratings among apps of same name.

The data contains **M** instead of `000,000` which we need to correct.

In [7]:
reviews_max={}

for app in apps_data2[1:]:
    name = app[0]
    mill=[]
    mill = app[3]
    # replace M with zeros 
    if 'M' in mill:
        mill.replace('M', '000000')
    n_reviews = float(mill[0])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))


        

        

9660


In below code, we the aim is to run through full list of apps, compare their rating with the maximum rating we have already got in above steps.

In [8]:
android_clean = []
already_added = []

for app in apps_data2[1:]:
    name = app[0]
    mill=[]
    mill = app[3]
    if 'M' in mill:
        mill.replace('M', '000000')
    n_reviews = float(mill[0])
    
    if n_reviews >= reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
    

After removing the duplicates and retaining one with highest ratings, now lets inspect cleaned table below.

In [9]:
explore_data(android_clean,0,2)
len(android_clean)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




9660

### Function to find app name is in English
In order to focus on applications in English, we need to spot the *non-English characters* and filter those rows out of our data set.

We are going to use `ord` to help find if it is non-english.

In [10]:
def check_english(any_name):
    check_value = 0
    # this function returns True when any 3+ non-english character is found.
    for letter in any_name:
        letter_value = ord(letter)
        if letter_value > 127 :
            check_value += 1
    if check_value <= 3:
        return True
    else:
        return False

print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
            
    

False
True


### English only apps

Now that we have got a function ready to test whether a name is English, upto three special characters allowed, or non-English; let us try to reduce both data sets to English only apps.

In [11]:
english_app2 = []

for app in android_clean:
    name = app[0]
    is_english = check_english(name)
    if is_english :
        english_app2.append(app)

print(len(english_app2))


    
english_app1 = []

for app in apps_data1[1:]:
    name = app[1]
    is_english = check_english(name)
    if is_english :
        english_app1.append(app)

print(len(english_app1))



9615
6183


### Filter Free Apps
Now we try to filter out the free apps in below.

In [12]:



free_app2 = []

for app in english_app2:
    price = app[7]
    if price == '0':
        free_app2.append(app)
        
free_app1 = []

for app in english_app1:
    price = app[4]
    if price == '0.0':
        free_app1.append(app)  

print(len(apps_data1))
print(len(free_app1))
print(len(apps_data2))
print(len(free_app2))

        

7198
3222
10842
8862


### Frequency
Now we need to create frequency table.

In [13]:
def frequency_table(data_set,column_index):
    frequency = {}
    number_ofvalues = 0
    # iterate over data set and add values to dictionary
    for row in data_set:
        value = row[column_index]
        number_ofvalues+=1
        if value in frequency:
            frequency[value]+=1
        else:
            frequency[value]=1
    
    for freq in frequency:
        frequency[freq] = 100*(frequency[freq]/number_ofvalues)
    return frequency

    

Sorted display will be useful in analysing the results.

In [14]:
def display_table(data_set, column_index):
    table = frequency_table(data_set, column_index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    

### Prime Genre of Apple app
From sorted result, we found **Games** is the top Prime genre of apple apps, *accounting for more than half of all apps*. This is followed by Entertainment apps. Social, shopping, sports and music are less compared to Entertainment apps.  

In [15]:
print(display_table(free_app1,-5))

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665
None


### Google App
Category of google play apps. Highest percentage is **Family**, followed by Games. *Games are just half of Family*. Tools and games are nearly the same. 

*Unlike Apple*, in Google there is gradual reduction in percentage, and there is no category that occupies nearly half of share. 

In [16]:
print(display_table(free_app2,1))

FAMILY : 18.472128187767996
GAME : 9.851049424509139
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.6770480704129994
PARENTING : 0.65

In [17]:
print(display_table(free_app2,-4))

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

In [18]:
print(apps_data1[0:2])
print(apps_data2[0:2])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']]
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']]


### Average number of application for each genre

Let us evaluate the average number of apps per genre.

In [19]:
def abs_frequency_table(data_set,column_index):
    frequency = {}
    number_ofvalues = 0
    for row in data_set:
        value = row[column_index]
        number_ofvalues+=1
        if value in frequency:
            frequency[value]+=1
        else:
            frequency[value]=1
    
    return frequency

## Most recommendable app profile for apple
These are the average user ratings per genre for Apple apps. We can see the **hightest** ones are *Social networking* and *Navigation*. 

In [20]:
prime_genre_apple = {}
prime_genre_apple = abs_frequency_table(free_app1,-5)

for genre in prime_genre_apple:
    total = 0
    len_genre = 0
    for row in free_app1:
        genre_app = row[-5]
        if genre_app == genre:
            user_rating = float(row[5])
            total+=user_rating
            len_genre+=1
    avg_user_rating = total/len_genre
    print(genre, ':', avg_user_rating)
    

Social Networking : 71548.34905660378
Food & Drink : 33333.92307692308
Lifestyle : 16485.764705882353
Productivity : 21028.410714285714
Finance : 31467.944444444445
Games : 22788.6696905016
Navigation : 86090.33333333333
Travel : 28243.8
Utilities : 18684.456790123455
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Reference : 74942.11111111111
Catalogs : 4004.0
Business : 7491.117647058823
Medical : 612.0
Sports : 23008.898550724636
Shopping : 26919.690476190477
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
News : 21248.023255813954
Education : 7003.983050847458
Music : 57326.530303030304


## ...for Google
From below, **Entertainment** would be recommended for Google play apps. Because it is among the top three, others being Social and Communication apps. 

In [23]:
category_2 = {}
category_2 = abs_frequency_table(free_app2,1)

for category in category_2:
    total = 0
    len_category = 0
    for row in free_app2:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+','')
            install = installs.replace(',','')
            num_install = float(install)
            total+=num_install
            len_category+=1
    avg_installs = total/len_category
    print(category, ':', avg_installs)

COMICS : 817657.2727272727
SOCIAL : 23253652.127118643
TOOLS : 10682301.033377837
NEWS_AND_MAGAZINES : 9549178.467741935
PERSONALIZATION : 5201482.6122448975
LIFESTYLE : 1437816.2687861272
SHOPPING : 7036877.311557789
MAPS_AND_NAVIGATION : 4056941.7741935486
PHOTOGRAPHY : 17805627.643678162
WEATHER : 5074486.197183099
COMMUNICATION : 38456119.167247385
HOUSE_AND_HOME : 1331540.5616438356
HEALTH_AND_FITNESS : 4188821.9853479853
MEDICAL : 120616.48717948717
VIDEO_PLAYERS : 24852732.40506329
PARENTING : 542603.6206896552
FOOD_AND_DRINK : 1924897.7363636363
BOOKS_AND_REFERENCE : 8767811.894736841
LIBRARIES_AND_DEMO : 638503.734939759
ENTERTAINMENT : 21134600.0
ART_AND_DESIGN : 1905351.6666666667
DATING : 854028.8303030303
TRAVEL_AND_LOCAL : 13984077.710144928
FINANCE : 1387692.475609756
SPORTS : 3638640.1428571427
GAME : 15896757.674684994
PRODUCTIVITY : 16787331.344927534
EDUCATION : 3082017.543859649
EVENTS : 253542.22222222222
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.886792