# Most Viewed Free Apps Analysis
## Project Overview

 This project is written for companies interested in viewing which free mobile apps.
 
 It will take data from Google Play and the App Store and provide a frequency table that defines the relationship between free app categories and users. 

**Reasons to explore this relationship**
 - a bulk of free apps make their revenue from in-app adds
 - Beneficiary to see what type of free apps are used the most
 
**Data sets used**
 - 7,000 iOS apps from the [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) (Jul 2017)
 - 10,000 Android apps from [Google Play](https://www.kaggle.com/lava18/google-play-store-apps/home) (Aug 2018)

In [1]:
#Open and Save Data
from csv import reader

opApple=open('AppleStore.csv')
rdApple=reader(opApple)
apple=list(rdApple)
hdApple=apple[0]
apple=apple[1::]

opGoogle=open('googleplaystore.csv')
rdGoogle=reader(opGoogle)
google=list(rdGoogle)
hdGoogle=google[0]
google=google[1::]

In [2]:
#Explore Data function
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
#Print Data Slice
explore_data(apple,0,2,True)
#Apple original 7197 rows and 16 col w/o header
print('\n')
explore_data(google,10472,10473,True)
#Google original 10841 rows and 13 col w/o header

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7197
Number of columns: 16


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
# View the Header for the Data Sets
print(hdApple)
print('\n')
print(hdGoogle)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


*From the Apple Data Set Header*
the important coloumns are:
 - price (to check if it's free)
 _ rating_count_tot (total number of raters)

In [5]:
#Clean Data
    #correct or remove inaccurate data
    #delete duplicate


In [6]:
#Google Play Row Error - RUN ONLY ONCE BECAUSE
#WILL DELETE MORE ROWS
del google[10472]

# Duplicate Data
 - There is duplicate data that must be identified and deleted in a logical manner
 - In the case of duplicate data, the version kept will be the one with the most number of reviews as older older collections will have lower number of reviews

In [7]:
#Duplicate Data in Apple? Quick Check via Name
for rowA in apple:
    nameA=rowA[1]
    if nameA == 'Instagram':
        print(rowA)

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


In [8]:
#Duplicate Data in Google? Quick Check via Name
for row in google:
    name=row[0]
    if name == 'Instagram':
        print(row)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [9]:
#Count Duplicates in Apple
dupl_appsA=[]
uniq_appsA=[]

for rowA in apple:
    nameA=rowA[1]
    if nameA in uniq_appsA:
        dupl_appsA.append(nameA)
    else:
        uniq_appsA.append(nameA)
print('Name of duplicate apps in Apple:',len(dupl_appsA))
print('Examples of duplicate apps in Apple:', dupl_appsA[:5])
print('Expected length of Apple w/ no duplicates:',int(len(apple)) - int(len(dupl_appsA)))

Name of duplicate apps in Apple: 2
Examples of duplicate apps in Apple: ['Mannequin Challenge', 'VR Roller Coaster']
Expected length of Apple w/ no duplicates: 7195


In [10]:
#Count Duplicates in Google
dupl_apps=[]
uniq_apps=[]

for row in google:
    name=row[0]
    if name in uniq_apps:
        dupl_apps.append(name)
    else:
        uniq_apps.append(name)
print('Name of duplicate apps in Google:',len(dupl_apps))
print('Examples of duplicate apps in Google:', dupl_apps[:5])
print('Expected length of Google w/ no duplicates:',int(len(google)) - int(len(dupl_apps)))

Name of duplicate apps in Google: 1181
Examples of duplicate apps in Google: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
Expected length of Google w/ no duplicates: 9659


In [11]:
# #Delete Duplicates in Apple
reviews_maxA={}
for rowA in apple:
    nameA=rowA[1]
    n_reviewsA=float(rowA[5])
    if nameA in reviews_maxA and n_reviewsA > reviews_maxA[nameA]:
        reviews_maxA[nameA]=n_reviewsA
    if nameA not in reviews_maxA:
        reviews_maxA[nameA]=n_reviewsA
print('Expected length: ', len(reviews_maxA))

Expected length:  7195


In [12]:
#Delete Duplicates in Google
reviews_max={}
for row in google:
    name=row[0]
    n_reviews=float(row[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name]=n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews
print('Expected length: ', len(reviews_max))
    

Expected length:  9659


In [13]:
#New List w/o Duplicates in Apple
iOS_clean=[]
already_addedA=[]
for rowA in apple:
    nameA=rowA[1]
    n_reviewsA=float(rowA[5])
    if n_reviewsA==reviews_maxA[nameA] and nameA not in already_addedA:
        iOS_clean.append(rowA)
        already_addedA.append(nameA)
        
print(len(iOS_clean))
print(len(iOS_clean[0]))

7195
16


In [14]:
#New List w/o Duplicates in Google
android_clean=[]
already_added=[]
for row in google:
    name=row[0]
    n_reviews=float(row[3])
    if n_reviews==reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))
print(len(android_clean[0]))


9659
13


In [15]:
#Function to Detect if Characters Belong in English Alphabet
def is_english(string):
    more3=0
    for character in string:
        if ord(character) > 127:
            more3+=1
    if more3 > 3:
        return False 
    return True
    
IG=is_english('Instagram')
CH=is_english('爱奇艺PPS -《欢乐颂2》电视剧热播')
TM=is_english('Docs To Go™ Free Office Suite')
EM=is_english('Instachat 😜')
print(IG,CH,TM,EM)

True False True True


In [16]:
#Filter non-English Apps from Apple
eng_iOS=[]
for rowA in iOS_clean:
    nameA=rowA[1]
    is_EngA=is_english(nameA)
    if is_EngA == True:
        eng_iOS.append(rowA)
        
print(len(eng_iOS))
print(len(eng_iOS[0]))  

6181
16


In [17]:
#Filter non-English Apps from Google
eng_android=[]
for row in android_clean:
    name=row[0]
    is_Eng=is_english(name)
    if is_Eng == True:
        eng_android.append(row)
        
print(len(eng_android))
print(len(eng_android[0]))  

9614
13


In [18]:
#Isolate Free Apps in Apple
free_iOS=[]
for rowA in eng_iOS:
    price=float(rowA[4])
    if price == 0:
        free_iOS.append(rowA)
                       
print(len(free_iOS))
print(len(free_iOS[0]))                         

3220
16


In [19]:
#Isolate Free Apps in Google
free_android=[]
for row in eng_android:
    price=row[7] #note don't convert to float because values>0 have $
    if price == '0': #therefore compare it with zero string; could do replace $ to blank
        free_android.append(row)
                       
print(len(free_android))
print(len(free_android[0])) 

print(0.0==0) #to check if float == integer

8864
13
True


## Why Find an App Profile that Fits both the App Store and Google Play?
 - The validation approach to minimze risk and overhead to build an app is:
 1. To build a minimal android app for Google Play
 2. Develop the app further if the response from users is positive
 3. Develope an iOS version for the App Stoe if the app is profitable after six months
 
In order to achieve these goals we must first optimize the potential by finding the best catagory that is popular for both iOS and android markets

For the App Store we'll utilize the `prime_genre column` and for the Google Play we'll use the `Category` column. For each we'll create a frequency table that shows percentages and another function will display the percentages in a descending order

## Sorting the Frequency Table
 - The 'sorted()' function doesn't work well with dictionaries because it only sorts the dictionary keys
 - Thus we will transform the dictionary into a list of tuples, which works well with the 'sorted()' function
 - each tuple (similar to a list, but with unchangeable values) contains a dictionary key along with the value and in order for the sorting to work correctly, the dictionary values will come first followed by the key


In [20]:
#Function to Create Frequency Table
def freq_table(dataset,index):
    #dataset - list of lists; index - integer
    freq={}
    for row in dataset:
        genre=row[index]
        if genre in freq:
            freq[genre]+=1
        else:
            freq[genre]=1
    sum_all=sum(freq.values())
    freq_per={}
    for key in freq:
        val=freq[key]
        freq_per[key]=(val/sum_all)*100
    return freq_per

In [21]:
#Function to Transform Frequency Table to List of Tuples
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [22]:
#Sorted Frequency Table for Apple
apple_sorted=display_table(apple,11)

Games : 53.66124774211477
Entertainment : 7.433652910935113
Education : 6.294289287203002
Photo & Video : 4.849242740030569
Utilities : 3.4458802278727245
Health & Fitness : 2.501042100875365
Productivity : 2.473252744198972
Social Networking : 2.3204112824788106
Lifestyle : 2.0008336807002918
Music : 1.9174656106711132
Shopping : 1.6951507572599693
Sports : 1.5839933305543976
Book : 1.5562039738780047
Finance : 1.445046547172433
Travel : 1.1254689453939142
News : 1.0421008753647354
Weather : 1.0004168403501459
Reference : 0.8892594136445742
Food & Drink : 0.8753647353063776
Business : 0.7919966652771988
Navigation : 0.6391552035570377
Medical : 0.31957760177851885
Catalogs : 0.1389467833819647


In [23]:
#Sorted Frequency Table for Google
apple_sorted=display_table(google,1)

FAMILY : 18.19188191881919
GAME : 10.55350553505535
TOOLS : 7.776752767527675
MEDICAL : 4.271217712177122
BUSINESS : 4.243542435424354
PRODUCTIVITY : 3.911439114391144
PERSONALIZATION : 3.616236162361624
COMMUNICATION : 3.5701107011070112
SPORTS : 3.5424354243542435
LIFESTYLE : 3.5239852398523985
FINANCE : 3.3763837638376386
HEALTH_AND_FITNESS : 3.1457564575645756
PHOTOGRAPHY : 3.0904059040590406
SOCIAL : 2.7214022140221403
NEWS_AND_MAGAZINES : 2.61070110701107
SHOPPING : 2.3985239852398523
TRAVEL_AND_LOCAL : 2.3800738007380073
DATING : 2.158671586715867
BOOKS_AND_REFERENCE : 2.1309963099630997
VIDEO_PLAYERS : 1.6143911439114391
EDUCATION : 1.4391143911439115
ENTERTAINMENT : 1.3745387453874538
MAPS_AND_NAVIGATION : 1.2638376383763839
FOOD_AND_DRINK : 1.1715867158671587
HOUSE_AND_HOME : 0.8118081180811807
LIBRARIES_AND_DEMO : 0.7841328413284132
AUTO_AND_VEHICLES : 0.7841328413284132
WEATHER : 0.7564575645756457
ART_AND_DESIGN : 0.5996309963099631
EVENTS : 0.5904059040590406
PARENTING : 

## Observation of Genres for Apple and Google
 - For Apple over 50% are games, while games are only about 10% for Google Play
 - It would be interesting to see what the definitions of each category  are in each store
 - Apple store top three are: Games, Entertainment and Education
 - Google store top three are: Famil,y Game and Tools
 - Both stores have very different patterns and they also don't follow a common pattern
 - In order to create an app profile more information than just the frequency table needs to be provided, as this only tells what genres are most popular not the popularity of the individual apps themselves
 
**Now to generate data on the number of users for each app genre**
 - For the App Store we'll use information on the total number of user ratings (`rating_count_tot` column)
 - For Google Play we'll use the number of installs (`Installs` column)

In [24]:
#Average Number of User Ratings per App Genre on the App Store
apple_freq=freq_table(apple,11)
for genre in apple_freq:
    total=0
    len_genre=0
    for row in apple:
        genre_app=row[11]
        if genre_app == genre:
            num_user_rat=float(row[5])
            total+=num_user_rat
            len_genre+=1
    avg_num=total/len_genre
    print(genre,':',avg_num)

Weather : 22181.027777777777
Games : 13691.996633868463
Social Networking : 45498.89820359281
News : 13015.066666666668
Business : 4788.087719298245
Navigation : 11853.95652173913
Health & Fitness : 9913.172222222222
Medical : 592.7826086956521
Book : 5125.4375
Food & Drink : 13938.619047619048
Sports : 14026.929824561403
Shopping : 18615.32786885246
Travel : 14129.444444444445
Catalogs : 1732.5
Music : 28842.021739130436
Utilities : 6863.822580645161
Productivity : 8051.3258426966295
Finance : 11047.653846153846
Reference : 22410.84375
Entertainment : 7533.678504672897
Lifestyle : 6161.763888888889
Photo & Video : 14352.280802292264
Education : 2239.2295805739514


# Top 5 Most Rated Apps in Apple Store
1) Social Networking
2) Music
3) Reference
4) Weather
5) Shopping

From experience as well as looking at the most rated apps, it appears that there is not a lot of social networking apps in the Apple Store, but the success rate (defined by the number of users) seems to be high. I think another parameter to look at is the length of time users stick to apps. It may be that there are a lot of games because user's are constantly looking for new games to play. After evaluating these parameters, for both the App Store and Google Play I would pick an app that is optimal for both markets .

In [25]:
#Percentage of Install Brackets in Google Play
display_table(google,5)

1,000,000+ : 14.566420664206642
10,000,000+ : 11.549815498154983
100,000+ : 10.784132841328413
10,000+ : 9.723247232472325
1,000+ : 8.367158671586717
5,000,000+ : 6.937269372693727
100+ : 6.632841328413284
500,000+ : 4.972324723247232
50,000+ : 4.418819188191882
5,000+ : 4.400369003690037
100,000,000+ : 3.7730627306273066
10+ : 3.560885608856089
500+ : 3.044280442804428
50,000,000+ : 2.6660516605166054
50+ : 1.8911439114391144
5+ : 0.7564575645756457
500,000,000+ : 0.6642066420664207
1+ : 0.6180811808118082
1,000,000,000+ : 0.5350553505535056
0+ : 0.12915129151291513
0 : 0.00922509225092251


In [26]:
#Average Number of Installs per Genre on Google Play
google_freq=freq_table(google,1)
for category in google_freq:
    total=0
    len_category=0
    for row in google:
        category_app=row[1]
        if category_app == category:
            n_installs=row[5]
            n_installs=n_installs.replace('+','')
            n_installs=n_installs.replace(',','')
            n_installs=float(n_installs)
            total+=n_installs
            len_category+=1
    avg_install=total/len_category
    print(category,':',avg_install)

NEWS_AND_MAGAZINES : 26488755.335689045
WEATHER : 5196347.804878049
BEAUTY : 513151.88679245283
ENTERTAINMENT : 19256107.382550336
FINANCE : 2395215.120218579
EVENTS : 249580.640625
DATING : 1129533.3632478632
TRAVEL_AND_LOCAL : 26623593.58914729
PARENTING : 525351.8333333334
MAPS_AND_NAVIGATION : 5286729.124087592
HOUSE_AND_HOME : 1917187.0568181819
MEDICAL : 115026.86177105832
EDUCATION : 5586230.769230769
BUSINESS : 2178075.7934782607
LIBRARIES_AND_DEMO : 741128.3529411765
GAME : 30669601.761363637
COMICS : 934769.1666666666
VIDEO_PLAYERS : 35554301.25714286
SHOPPING : 12491726.096153846
HEALTH_AND_FITNESS : 4642441.3841642225
LIFESTYLE : 1407443.8193717278
PHOTOGRAPHY : 30114172.10447761
SPORTS : 4560350.255208333
FOOD_AND_DRINK : 2156683.0787401577
COMMUNICATION : 84359886.95348836
BOOKS_AND_REFERENCE : 8318050.112554112
ART_AND_DESIGN : 1912893.8461538462
PRODUCTIVITY : 33434177.75707547
AUTO_AND_VEHICLES : 625061.305882353
SOCIAL : 47694467.46440678
TOOLS : 13585731.809015421
PE