# Mobile App's downloads analysis

Using only python build-in data structures.
- To analyze data to understand what type of apps are likely 
to attract more users.
- To help out developers make a decission what kind of Apps they should work on.

In [2]:
import csv
AppleDataObj=open('data//AppleStore.csv',encoding="utf8")
GoogleDataObj=open('data//googleplaystore.csv',encoding="utf8")
AppleData=list(csv.reader(AppleDataObj))
GoogleData=list(csv.reader(GoogleDataObj))

def explore_data(dataset,start,end,rows_and_columns=False):
    datasetSlice=dataset[start:end]
    for row in datasetSlice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ' +str(len(dataset)))
        print('Number of columns: '+str(len(dataset[0])))
explore_data(AppleData,1,3,rows_and_columns=True)
explore_data(GoogleData,1,3,rows_and_columns=True)


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In the above code we are reading two csv files 'AppleStore.csv' and 'googleplaystore.csv' then we are printing a bit of each code to get a taste how the data looks like.

In [29]:
print(AppleData[0])
print('\n')
print(GoogleData[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


For 'AppleData' we can use columns:

|Column name|Description|
|---|---|
|'track_name'|The name of App|
|'prime_genre'|Category box|
|'user_rating'|How good is rated by users|
|'rating_count_tot'|How many users rated|
|'price'|How much it costs|

We can find detailed description here [Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

For 'GoogleData' We can use columns:

|Column name|Description|
|---|---|
|'App'|The name of App|
|'Category'|Category box|
|'Rating'|How goood is rated by users|
|'Reviews'|How many comments users left|
|'Installs'|How many users downloaded it|
|'Type'|Paid or free|

We can find detailed description here [Link](https://www.kaggle.com/lava18/google-play-store-apps)

In [30]:
print(GoogleData[10473])
del GoogleData[10473]
del GoogleData[9149]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Above we have deleted some broken data with missing values 
which were spotted on 'kaggle.com' in the discussion section.

We have also discovered that some data is duplicated e.g. 
as below:

In [31]:
duplicateNames=[]
uniqueNames=[]
for app in GoogleData:
    name=app[0]
    if name in uniqueNames:
        duplicateNames.append(name)
    else:
        uniqueNames.append(name)
print('Number of duplicates: '+str(len(duplicateNames)))
print('Example of duplicate app: '+duplicateNames[0])


Number of duplicates: 1181
Example of duplicate app: Quick PDF Scanner + OCR FREE


As we can see above. There is 1181 duplicated rows, we printed an example of one of the app which is duplicated.
Now we can print all rows having that app name:

In [32]:
for app in GoogleData:
    name=app[0]
    if name=='Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


We should keep only one record for each app name. In this case there is 3 rows with the same app and the only difference is column 4 - revision counts. Not to delete 2 of them randomly lets follow some criteria. Lets keep the latest record. Here we have the bottom row with '80804' revisions and remaining both have each '80805' reviews which probably means that since second data aqusition the number of reviews hasn't changed. So lets keep only one of them, lets say that one with lower index.

Lets now create a dictionary with single app present only once:

In [33]:
reviews_max={}
for app in GoogleData[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if name in reviews_max.keys() and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    reviews_max.setdefault(name, n_reviews)
print(len(reviews_max))

9658


Here we created 'reviews_max' where we put only unique app names with the highest review scores.
Lets now remove remaining duplicated values:

In [34]:
android_clean=[]
already_added=[]
for app in GoogleData[1:]:
    name=app[0]
    n_reviews=float(app[3])
    if n_reviews==reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)
        
        

In [35]:
print(len(android_clean))

9658


The above code builds the list of cleaned data with only one app name which presents in 'reviews_max' dictionary. To do so we looped through GoogleData dataset and appended the 'android_clean' list. We have also created another list called 'already_added' to skip any app name if it has the same revision number - as in our previous example. The last print function is there to check that both 'android_clean' and 'reviews_max' repositories have the same lenght.
After that we need to get rid of any non-English app name while we are interested only about English speakin market:

In [36]:
def English_AppName(app):
    portion=0
    for n in app:
        if ord(n)>127:
            portion+=1
    if portion >3:
        return False
    return True

This function takes a string of app name as a parameter and returns 'True' if thats composed of English letters, otherwise it returns 'False'. English characters are all in range 0 to 127 according to the ASCII. We also set some limit that if any string consist of more then 3 non-English characters its treat as non-English word, otherwise its still English - see below:

In [37]:
print(English_AppName('Docs To Go™ Free Office Suite'))

True


Lets now filter our two datasets:        

In [38]:
GoogleData_clean=[]
for app in android_clean:
    if English_AppName(app[0])==True and app[6]=='Free':
        GoogleData_clean.append(app)
AppleData_clean=[]
for app in AppleData[1:]:
    if English_AppName(app[1])==True and app[4]=='0.0':
        AppleData_clean.append(app)  

Using our created function we filtered out non-English apps from both datasets appending only English titles to 'GoogleData_clean' and 'AppleData_clean'. The next step is to filter our datasets to include only Free apps becasue our assumption is that our source of revenue consists of in-app ads.

In [39]:
print(len(GoogleData_clean),len(AppleData_clean))

8863 3222


We are done with data cleaning activities. Our purpose is to check if our apps are attract to the customers. To check that we can find App which are popular on both Apple and Google services. The next step is to find the hottest genres for each market. Now we need to sort our data:

In [40]:
def freq_table(dataset,index):
    Categories={}
    for app in dataset:
        if app[index] not in Categories:
            Categories[app[index]]=0
    for app in dataset:
        if app[index] in Categories:
            Categories[app[index]]+=1
    total_number_apps=len(dataset)
    for category in Categories:
        Categories[category]=round(Categories[category]*100/total_number_apps,2)
    return Categories    


In [41]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [42]:
display_table(GoogleData_clean,1)
print('\n')
display_table(AppleData_clean,11)


FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs 

Through last lines of code we segregated our datasets basend on Category column. First of all we defined frequency table sorting our data basend on percentage share of the particular category compare to total English apps. Then using 'display_table' function we sorted result in a descending order. Here we printed results for App store. As we can see the most common genre among English apps is Games, The runner-up is Entertainment. If we can compare them gruping by two purposes- whether its more for a practical purpose or more for entertainment. To the first group we can assign e.g. education, shopping, utilities, productivity, lifestyle. Whereas to the second one we can assign e.g. games, photo and video, social networking, music. We can clearly see that much more apps belong to entertainment group. It can mean that the most popular genre among users is Games. Lets find some more proves.

In [43]:
CategoriesData=freq_table(AppleData_clean,11)
Users_per_genre=[]
for genre in CategoriesData.keys():
    total=0
    len_genre=0
    for app in AppleData_clean:
        genre_app=app[11]
        if genre_app==genre:
            users_rated=float(app[5])
            total+=users_rated
            len_genre+=1
    rating_avg=total/len_genre
    Users_per_genre.append((round(rating_avg),genre))

In [44]:
def Table_sort(table_tuple):
    table_sorted = sorted(table_tuple, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
Table_sort(Users_per_genre)

Navigation : 86090
Reference : 74942
Social Networking : 71548
Music : 57327
Weather : 52280
Book : 39758
Food & Drink : 33334
Finance : 31468
Photo & Video : 28442
Travel : 28244
Shopping : 26920
Health & Fitness : 23298
Sports : 23009
Games : 22789
News : 21248
Productivity : 21028
Utilities : 18684
Lifestyle : 16486
Entertainment : 14030
Business : 7491
Education : 7004
Catalogs : 4004
Medical : 612


Above we were calculating the average number of user ratings for rach genre. Better factor which is No of installations in not avaliable for App store dataset so we had to use some substitute. We have taken the number of reviews left by users and calculated an average number of users per one genre. We can see that this time the top one is Navigation category.Reference and Social networking is also popular. We could split them into 4 groups: above 70000 the top one - having 3 top categories. The second one: 30001-70000 - with 5 categories. The third one 10000-3000 - 11 categories. And the last one < 10000 - having 4 categories. Thats probalby the best factor which we can take into account planning which grup of apps to develop. Navigation, Refference and Social Networking. If we compare these with number of apps in the App store which we get previously we can find that there is very less apps dedicated for Navigation and at the same time there is a huge demand. We need to also remember that we limited our data to only English languague and non-Paid apps. The next factor worth to mention is that maybe market is dominated by a few apps from giant companies like google etc. Lets print a bit of our data to see what's in these categories.

In [63]:
def Apple_TopCategory(category):
    Apple_topApps=[]
    for app in AppleData_clean:
        if app[11]==category:
            if app[1] not in Apple_topApps:
                Apple_topApps.append(app)

    for n in Apple_topApps:
        print(n[1],n[5])
        
Apple_TopCategory('Navigation')

Waze - GPS Navigation, Maps & Real-time Traffic 345046
Google Maps - Navigation & Transit 154911
Geocaching® 12811
CoPilot GPS – Car Navigation & Offline Maps 3582
ImmobilienScout24: Real Estate Search in Germany 187
Railway Route Search 5


As we can see on above quick analysis the marked in domianted by 2 companies - 'Waze' and 'Google Maps'. It can be hard to attract new customers. Now lets calculate the number of users for Google Dataset.

In [64]:
for n in range(len(GoogleData)):
    n_installs=GoogleData[n][5].replace('+','').replace(',','')
    GoogleData[n][5]=n_installs
CategoriesData=freq_table(GoogleData_clean,1)
Users_per_genre=[]
for genre in CategoriesData.keys():
    total=0
    len_genre=0
    for app in GoogleData_clean:
        genre_app=app[1]
        if genre_app==genre:
            users_rated=float(app[5])
            total+=users_rated
            len_genre+=1
    rating_avg=total/len_genre
    Users_per_genre.append((round(rating_avg),genre))

def Table_sort(table_tuple):
    table_sorted = sorted(table_tuple, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
Table_sort(Users_per_genre)

COMMUNICATION : 38456119
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
PHOTOGRAPHY : 17840110
PRODUCTIVITY : 16787331
GAME : 15588016
TRAVEL_AND_LOCAL : 13984078
ENTERTAINMENT : 11640706
TOOLS : 10801391
NEWS_AND_MAGAZINES : 9549178
BOOKS_AND_REFERENCE : 8767812
SHOPPING : 7036877
PERSONALIZATION : 5201483
WEATHER : 5074486
HEALTH_AND_FITNESS : 4188822
MAPS_AND_NAVIGATION : 4056942
FAMILY : 3697848
SPORTS : 3638640
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924898
EDUCATION : 1833495
BUSINESS : 1712290
LIFESTYLE : 1437816
FINANCE : 1387692
HOUSE_AND_HOME : 1331541
DATING : 854029
COMICS : 817657
AUTO_AND_VEHICLES : 647318
LIBRARIES_AND_DEMO : 638504
PARENTING : 542604
BEAUTY : 513152
EVENTS : 253542
MEDICAL : 120551


As we can see the most popular categories are: 'Comunication', 'Video players', 'Social'. Lets now see what is in the top one:

In [65]:
def Google_TopCategory(category):
    Google_topApps=[]
    for app in GoogleData_clean:
        if app[1]==category:
            if app[0] not in Google_topApps:
                Google_topApps.append(app)

    for n in Google_topApps:
        print(n[0],n[5])
        
Google_TopCategory('COMMUNICATION')

WhatsApp Messenger 1000000000
Messenger for SMS 10000000
My Tele2 5000000
imo beta free calls and text 100000000
Contacts 50000000
Call Free – Free Call 5000000
Web Browser & Explorer 5000000
Browser 4G 10000000
MegaFon Dashboard 10000000
ZenUI Dialer & Contacts 10000000
Cricket Visual Voicemail 10000000
TracFone My Account 1000000
Xperia Link™ 10000000
TouchPal Keyboard - Fun Emoji & Android Keyboard 10000000
Skype Lite - Free Video Call & Chat 5000000
My magenta 1000000
Android Messages 100000000
Google Duo - High Quality Video Calls 500000000
Seznam.cz 1000000
Antillean Gold Telegram (original version) 100000
AT&T Visual Voicemail 10000000
GMX Mail 10000000
Omlet Chat 10000000
My Vodacom SA 5000000
Microsoft Edge 5000000
Messenger – Text and Video Chat for Free 1000000000
imo free video calls and chat 500000000
Calls & Text by Mo+ 5000000
free video calls and chat 50000000
Skype - free IM & video calls 1000000000
Who 100000000
GO SMS Pro - Messenger, Free Themes, Emoji 100000000
Mes

Here we can see that for Android apps the market is not so cemented.There are some leaders on in the group but there is much more competition compare to Apple market. The analysis of the destribution of downloads in each group could have helped us to chose the best target category for us.