# Analyzing Mobile App data
For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

#### Data set discription: AppleStore.csv 
 | Column Name     | Description |
| :-----:        |    :----:   |
 | "id" | App ID|
| "track_name"| App Name|
|"size_bytes"| Size (in Bytes)|
|"currency"| Currency Type|
|"price"| Price amount|
|"ratingcounttot"| User Rating counts (for all version)|
|"ratingcountver"| User Rating counts (for current version)|
|"user_rating" |Average User Rating value (for all version)|
|"userratingver"| Average User Rating value (for current version)|
|"ver"  |Latest version code|
|"cont_rating"| Content Rating|
|"prime_genre"| Primary Genre|
|"sup_devices.num"| Number of supporting devices|
|"ipadSc_urls.num"| Number of screenshots showed for display|
|"lang.num"| Number of supported languages|
|"vpp_lic"| Vpp Device Based Licensing Enabled|

"A data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).\n",
    "- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).\n"
    "\n"


In [1]:
def open_dataset(file_name,header=True):
    oppened=open(file_name,encoding="utf8" )
    from csv import reader
    read_file=reader(oppened)
    if(header): return list(read_file)[1:]
    else: return list(read_file)


In [2]:
app_store=open_dataset("AppleStore.csv",False)
print(app_store[:5])
play_store= open_dataset("googleplaystore.csv",False)
print(play_store[:5])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']]
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_A

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(app_store,0,3,True)
explore_data(play_store,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone

In [5]:
print(app_store[0],"\n\n\n\n\n\n",play_store[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 





 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [6]:
# app_store[0]
print(play_store[0],"\n\n\n\n")
print(play_store[10473])

#we can see its ratings are 19.....> 5 so it is incorrect... so we will delete this row...
print(len(play_store))
del(play_store[10473])   # deleted and ran only once..
print(len(play_store))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 




['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10842
10841


In [7]:
print(play_store[10473])       # changed...

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [8]:
# duplicate_apps=list()
# unique_apps=list()
# c=0
# td= dict()
# def dupli_uni(data_set,name_indx):     # assuming that data set will have headers..
#     duplicate_apps=list()
#     unique_apps=list()
#     c=0
#     td= dict()
#     for row in data_set[1:]:
#         name=row[name_indx]
#         td[name]=td.get(name,0)+1
        
#     for k in td.keys():
# #         c+=1
# #     return c
#         if(td[k]>1):c+=td[k]
#     return c

#     for key in td.keys():
#         if(td[key]==1):
#             unique_apps.append(key)
#         else: 
#             duplicate_apps.append(key)

#     return duplicate_apps

# Self Cleaning

In [9]:

def dupli_uni(data_set,name_indx):     # assuming that data set will have headers..
    duplicate_apps=list()                          # we should declare temporaty variables inside function because we
    unique_apps=list()             # call function then onlya function block will be executed not any other lines....
    td= dict()
    for row in data_set[1:]:
        name=row[name_indx]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    return duplicate_apps

# print(len(dupli_uni(play_store,0)))     #this gives no of entrees to remove.. 
# print(dupli_uni(play_store,0))
# print(play_store[2][0])

def max_ind(data_set,lst,rev_ind):
    temp=lst[0] 
    for i in lst:
        if(data_set[temp][rev_ind] <= data_set[i][rev_ind]): mxi=i
    return mxi

In [10]:
# removing duplicates and keeping maximum reviewed app (index = 3).... play store...

# iterating on data set to make name-indexes map and index-0 map..
nam_inds=dict()
indx_to_rmv=list()
c=0
for row in play_store[1:]:
    c+=1
    name=row[0]
    if name in nam_inds:
        nam_inds[name].append(c)
    else:
         nam_inds[name]=list()
         nam_inds[name].append(c)
li=list(nam_inds.keys())
for k in li:
    if(len(nam_inds[k])>1): continue
    else: nam_inds.pop(k)

# now making a function that returns indx with maximum  no of  revew
# and removing that element + making a list of indxsx to remove.....
for k in nam_inds:
    nam_inds[k].remove(max_ind(play_store,nam_inds[k],3))
    indx_to_rmv+=nam_inds[k]
    




# li=list(nam_inds.keys())
print(len(indx_to_rmv))         
        






1181


In [11]:
cleaned_play=list()
# cleaned_play.append(play_store[0])  # adding headers...

# copying data which is not in indx_to_rmv list.
c=0
for row in play_store:
    if(c not in indx_to_rmv): cleaned_play.append(row)
    c+=1
    
print(len(cleaned_play))     

9660


this lenth is 9660 because i have added header row also..

# cleaning data as instructed...

In [12]:
print(len(dupli_uni(play_store,0)))

1181


In [13]:
# step 1 to find all list of unique app names and thier maximum reviews  ...

def name_revs(data_set,rev):          # takes duplicated list and returns dict( unique names-reviews) ....
    na_rv=dict()
    for row in data_set[1:]:
        nam=row[0]
        na_rv.setdefault(nam,0)
        if (na_rv[nam]<int(row[rev])): na_rv[nam]=int(row[rev])
    return na_rv   
            
    

Use the dictionary you created above to remove the duplicate rows:

Start by creating two empty lists: android_clean (which will store our new cleaned data set) and already_added (which will just store app names).
Loop through the Google Play dataset (don't include the header row), and for each iteration, do the following:
Assign the app name to a variable named name.
Convert the number of reviews to float, and assign it to a variable named n_reviews.
If n_reviews is the same as the number of maximum reviews of the app name (the number can be found in the reviews_max dictionary) and name is not already in the list already_added (read the solution notebook to find out why we need this supplementary condition):
Append the entire row to the android_clean list (which will eventually be a list of lists and store our cleaned dataset).
Append the name of the app name to the already_added list — this helps us to keep track of apps that we already added.

In [14]:
max_reviews=name_revs(play_store,3)

# setp 2 
android_clean=list()
android_clean.append(play_store[0])    # adding headers...

already_added=list()   # we need to do this because we are just comparing no of revs if two duplicated
                                   # apps have same revs and maximum also then we will end up with adding it twice
for row in play_store[1:]:
    name=row[0]
    if(max_reviews[name]==float(row[3]) ) and (name not in already_added ): 
        android_clean.append(row)
        already_added.append(name)

        
# here we will have clened data set.... 9659 + 1 rows...
print(len(android_clean))


9660


In [15]:
explore_data(android_clean,0,4,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9660
Number of columns: 13


In [16]:
for let in android_clean[4413][0]:
    print(ord(let),"\n")

20013 

22269 

35486 

32 

65 

81 

12522 

12473 

12491 

12531 

12464 



To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [17]:
def eng_check(stri):    # returns true if this is only english appp
    c=0
    for ch in stri:
        if ord(ch)>127:
            c+=1
            if c>3: 
                return False
    return True


In [18]:
st='Docs To Go™ Free Office Suite Instachat 😜🎞'
eng_check(st)

True

In [19]:
android_cleaned2=list()
android_cleaned2.append(android_clean[0])
for row in android_clean[1:]:
    if(eng_check(row[0])): android_cleaned2.append(row)
    

explore_data(android_cleaned2,0,3,True)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9615
Number of columns: 13


In [20]:
ios_cleaned1=list()
ios_cleaned1.append(app_store[0])
for row in app_store[1:]:
    if(eng_check(row[1])): ios_cleaned1.append(row)

explore_data(ios_cleaned1,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6184
Number of columns: 16


# isolating Free apps

In [21]:
ls=list()
free_andro=list()
free_andro.append(android_cleaned2[0])
for row in android_cleaned2[1:]:
    price=row[7] 
    price=price.lstrip(" $")
    if(float(price)==0): free_andro.append(row)

free_ios=list()
free_ios.append(ios_cleaned1[0])
for row in ios_cleaned1[1:]:
    price=row[4] 
    price=price.lstrip(" $")
    if(float(price)==0): free_ios.append(row)
        
        
explore_data(free_andro,0,3,True)
explore_data(free_ios,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 8865
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [22]:
# Making frequency tables: Mahesh Pandey
def  freq_table(data_set,ind):
    temp=dict()
    l=0
    for row in data_set[1:]:
        val=row[ind]
        temp[val]=temp.get(val,0)+1
        l+=1
    for k in temp:
        temp[k]/=(l*0.01)
    return temp
     


In [23]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [24]:
display_table(free_andro,1) # category
display_table(free_andro,9) # geners
display_table(free_ios,11) # genre

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.700361010830325
MEDICAL : 3.5311371841155235
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.237815884476534
HEALTH_AND_FITNESS : 3.079873646209386
PHOTOGRAPHY : 2.9444945848375452
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768953
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418774
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075813
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

In [25]:
display_table(free_ios,11) # genre

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.6623215394165114
Social Networking : 3.289882060831782
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.048417132216015
Health & Fitness : 2.017380509000621
Productivity : 1.7380509000620734
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310367
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.43451272501551835
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157667


In [26]:
def ratings(data,ri,gi):
#     uniq_nam=freq_table(data,gi)
    nam_indx=dict()
    avg_ratings=dict()
    for row in data[1:]:
        if row[gi] in nam_indx:
            nam_indx[row[gi]].append(int(row[ri]))
        else :
            nam_indx[row[gi]]=list()
            nam_indx[row[gi]].append(int(row[ri]))
#         nam_indx[row[gi]]=(nam_indx.get(row[gi],list())).append(row[ri])
    
    for k in nam_indx:
        avg_ratings[k]=avg_ratings.get(k,0)+(sum(nam_indx[k])/len(nam_indx[k]))
    
    print(avg_ratings)
    
    
  
    

In [27]:
ratings(free_ios,5,11)

{'Social Networking': 71548.34905660378, 'Photo & Video': 28441.54375, 'Games': 22788.6696905016, 'Music': 57326.530303030304, 'Reference': 74942.11111111111, 'Health & Fitness': 23298.015384615384, 'Weather': 52279.892857142855, 'Utilities': 18684.456790123455, 'Travel': 28243.8, 'Shopping': 26919.690476190477, 'News': 21248.023255813954, 'Navigation': 86090.33333333333, 'Lifestyle': 16485.764705882353, 'Entertainment': 14029.830708661417, 'Food & Drink': 33333.92307692308, 'Sports': 23008.898550724636, 'Book': 39758.5, 'Finance': 31467.944444444445, 'Education': 7003.983050847458, 'Productivity': 21028.410714285714, 'Business': 7491.117647058823, 'Catalogs': 4004.0, 'Medical': 612.0}


In [28]:
def display_Sorted_table(table):
#     table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [29]:
display_Sorted_table(freq_table(free_andro,5))

1,000,000+ : 15.72653429602888
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.1985559566787
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.772111913357401
5,000+ : 4.512635379061372
10+ : 3.542418772563177
500+ : 3.2490974729241877
50,000,000+ : 2.3014440433212995
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.7897111913357401
1+ : 0.5076714801444043
500,000,000+ : 0.27075812274368233
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


In [30]:
# st=" 1,00,000+"  to 100000
import re
def fun(st):
    p=re.compile('\D')
    return p.sub("",st)

In [31]:
def avg_installs(data,ctgry_indx,instl_ind):
    uniq=freq_table(data,ctgry_indx)
    temp=dict()
    for k in uniq:
        lenn=0
        summ=0
        for row in data[1:]:
            if k==row[ctgry_indx] :
                st=row[instl_ind]
                summ+= int(fun(st))
                lenn+=1
        temp[k]=summ/lenn
    return temp

#         print(k,"   :   ",


In [32]:
display_Sorted_table(avg_installs(free_andro,1,5))

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

i will recommend a productivity or Tool app