# Profitable App Profiles for the App Store and Google Play Markets

### The goal of this project is to analyze data to find the type of apps that are likely to attract more users. This project was submitted as a guided project for Python for Data Science: fundamentals course at Dataquest.io.
<br> Data sets:[Mobile App Statistics (Apple iOS app store)](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home), [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps/home)

In [1]:
from csv import reader
import re

## Reading Data sets

In [2]:
apple_data = list(reader(open('AppleStore.csv')))
google_data = list(reader(open('googleplaystore.csv')))

### This function is to print the provided data set

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
explore_data(google_data, 3,7,True)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 10842
Number of columns: 13


### Number of rows in both data sets
<br> remember that first row contains the name of columns

In [5]:
print("No of rows in android apps data set : ", len(google_data) - 1)
print("No of rows in IOS apps data set : ", len(apple_data) - 1)

No of rows in android apps data set :  10841
No of rows in IOS apps data set :  7197


### Columns in android apps data set

In [6]:
google_data[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

### Columns in IOS apps data set


In [7]:
apple_data[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

## Data Cleaning

### Due to [error](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) reported in android apps data set we remove that row after confirming that the error exists

In [8]:
len(google_data[10473]) == len(google_data[0])

False

In [9]:
del google_data[10473]

notice the change in size of android apps data set after deleting a row

In [10]:
len(google_data) - 1

10840

### duplicate entries
Our data sets might contain some duplicate entries and we do not need to keep all of them. Lets see if there are any duplicate entries in android apps data set.

In [12]:
unique_apps = []
for row in google_data[1:]:
    if row[0] not in unique_apps:
        unique_apps.append(row[0])
print("Number of unique apps : ",len(unique_apps))
print("Number of total entries : ", len(google_data) - 1)

Number of unique apps :  9659
Number of total entries :  10840


### We can see that number of unique apps is not equal to total no of entries, hence this data set contains duplicate entries. Now lets check same for IOS apps.

In [13]:
unique_apps = []
for row in apple_data[1:]:
    if row[0] not in unique_apps:
        unique_apps.append(row[0])
print("Number of unique apps : ",len(unique_apps))
print("Number of total entries : ", len(apple_data) - 1)

Number of unique apps :  7197
Number of total entries :  7197


### IOS apps data set has no duplicate entries

## Removing duplicate entries from android apps data set
### Amon the duplicates, we will keep the entry which has the most number of reviews assuming that it is the most recent entry 



In [14]:
# returns the list of apps that has duplicate entries
def getDuplicateApps(dataset):
    unique_apps = []
    duplicate_apps = []
    for row in dataset[1:]:     
        if row[0] in unique_apps and row[0] not in duplicate_apps:
            duplicate_apps.append(row[0])
        elif row[0] not in unique_apps:
            unique_apps.append(row[0])

    return duplicate_apps

In [15]:
# finds the indexes of the entries for a certain app
def findIndex(dataset, app):
    index_list = []
    for index, row in enumerate(dataset):
        if row[0] == app:
            index_list.append(index)
    return index_list


In [16]:
# deletes entries at certain indexes
def deleteEntries(dataset, indexes):
    # finds the max value of reviews among all entires for an app
    max_reviews = max(int(dataset[ind][3]) for ind in indexes)  
    # all of our duplicate entries might have same no of reviews
    # so we need to make sure that we have kept one entry
    # and deleted others
    entry_found = False
    # after deletion of entry, the indexes of other entries are
    # decremented by one position, i is used to handle it
    i = 0
    for ind in indexes:
        if int(dataset[ind][3]) < max_reviews:
            #print("removing ",dataset[ind - i][0])
            del dataset[ind - i]
            #print("removed at index ",ind - i)
            i = i + 1
        elif int(dataset[ind - i][3]) == max_reviews and entry_found == False:
            entry_found = True
            #print("found ",dataset[ind - i][0])
        else:
            #print("removing ",dataset[ind - i][0])
            del dataset[ind - i]
            #print("removed at index ",ind - i)
            i = i + 1
    return dataset


In [17]:
# calling above methods to remove duplicate entries
duplicate_apps_names = getDuplicateApps(google_data)
for app in duplicate_apps_names:
    indexes = findIndex(google_data, app)    
    google_data = deleteEntries(google_data, indexes)



In [18]:
# number of entries in android apps data set after 
# removing duplicate entries
len(google_data) - 1

9405

### removing non english apps
since we are interested in analyzing only English apps so we must remove non english apps

ord(character) returns the unicode for that character.
All numbers and alphabets fall in the range of 0 - 127 (inclusive). Letter like symbols have unicode : 8448 - 8527
some english apps have emoticons or symbols in their titles and it is hard to put checks for such cases. so we are allowing up to three characters that fall outside the range of alphabhets and numbers, and is also not letter like symbol.
    

In [19]:
# returns True if the given string is all english
def is_english(app):
    count =0
    for ch in app:
        if ord(ch) not in range(0,128) and ord(ch) not in range(8448, 8528):
            count = count + 1
            if(count == 4):
                return False

    return True
    

In [20]:
# remove non english apps from given data set, app_name_index is the
# index of app title in given data set
def remove_nonEnglish(dataset, app_name_index):
    i = 0
    for index, row in enumerate(dataset[1:]):
        if is_english(row[app_name_index]) == False:
            #print("removing ", dataset[index + 1 - i][app_name_index])
            del dataset[index + 1 - i]
            i = i + 1 
    return dataset

In [21]:
# function calls to remove nonenglish apps from both data sets
google_data = remove_nonEnglish(google_data, 0)
apple_data = remove_nonEnglish(apple_data,1)


We also wants to analyze only free apps so we must remove apps that are not free

In [22]:
#  not free apps
# ind is the index for column that indicates whether an app is free or not
# free_rate is the value used for free apps in given data set
def remove_notFree_apps(dataset, ind, free_rate):
    # keeps count of deleted rows for index calculation
    i = 0
    for index,row in enumerate(dataset[1:]):
        if row[ind] != free_rate :
            del dataset[index+1-i]
            i = i + 1
    return dataset 
        


In [23]:
apple_data = remove_notFree_apps(apple_data,4, "0.0")
google_data = remove_notFree_apps(google_data, 7, "0")

Number of rows in both data sets after data cleaning

In [24]:
len(apple_data) - 1

3222

In [25]:
len(google_data) -1

8613

## identifying most common genres in both app markets

### Our goal is to find apps that are profitable so we would look at different genres to identify the most profitable genre in both data sets


#### For Android Apps

In [26]:
# an empty dictionary where key = genre and value = frequency
genre_dict = {}
for row in google_data[1:]:
    if row[1] in genre_dict:
        genre_dict[row[1]] += 1
    else:
        genre_dict[row[1]] = 1
        
print(genre_dict)

{'TOOLS': 746, 'VIDEO_PLAYERS': 157, 'LIFESTYLE': 344, 'NEWS_AND_MAGAZINES': 245, 'PARENTING': 58, 'HOUSE_AND_HOME': 70, 'COMICS': 55, 'LIBRARIES_AND_DEMO': 82, 'AUTO_AND_VEHICLES': 82, 'WEATHER': 70, 'PERSONALIZATION': 292, 'SHOPPING': 193, 'SPORTS': 295, 'ENTERTAINMENT': 81, 'SOCIAL': 222, 'GAME': 813, 'MEDICAL': 310, 'ART_AND_DESIGN': 57, 'FOOD_AND_DRINK': 102, 'BEAUTY': 53, 'MAPS_AND_NAVIGATION': 124, 'COMMUNICATION': 258, 'DATING': 148, 'BUSINESS': 403, 'BOOKS_AND_REFERENCE': 184, 'PRODUCTIVITY': 340, 'PHOTOGRAPHY': 250, 'TRAVEL_AND_LOCAL': 199, 'EVENTS': 63, 'FINANCE': 320, 'FAMILY': 1632, 'HEALTH_AND_FITNESS': 264, 'EDUCATION': 101}


This function sorts a dictionary and return it as a list

In [37]:
def display_dict(dict_):
    return sorted(dict_.items(), key=lambda kv: kv[1])


#### sorted frequencies

In [38]:
display_dict(genre_dict)

[('BEAUTY', 53),
 ('COMICS', 55),
 ('ART_AND_DESIGN', 57),
 ('PARENTING', 58),
 ('EVENTS', 63),
 ('HOUSE_AND_HOME', 70),
 ('WEATHER', 70),
 ('ENTERTAINMENT', 81),
 ('LIBRARIES_AND_DEMO', 82),
 ('AUTO_AND_VEHICLES', 82),
 ('EDUCATION', 101),
 ('FOOD_AND_DRINK', 102),
 ('MAPS_AND_NAVIGATION', 124),
 ('DATING', 148),
 ('VIDEO_PLAYERS', 157),
 ('BOOKS_AND_REFERENCE', 184),
 ('SHOPPING', 193),
 ('TRAVEL_AND_LOCAL', 199),
 ('SOCIAL', 222),
 ('NEWS_AND_MAGAZINES', 245),
 ('PHOTOGRAPHY', 250),
 ('COMMUNICATION', 258),
 ('HEALTH_AND_FITNESS', 264),
 ('PERSONALIZATION', 292),
 ('SPORTS', 295),
 ('MEDICAL', 310),
 ('FINANCE', 320),
 ('PRODUCTIVITY', 340),
 ('LIFESTYLE', 344),
 ('BUSINESS', 403),
 ('TOOLS', 746),
 ('GAME', 813),
 ('FAMILY', 1632)]

### This function returns the percentage proportion for each genre in given data set
index indicates the index of column for genre

In [29]:
def freq_table(dataset, index):
    freq_dict = {}
    for row in dataset[1:]:
        if row[index] in freq_dict:
            freq_dict[row[index]] += 1
        else:
            freq_dict[row[index]] = 1
    freq_dict={k:(v/len(dataset) * 100) for k, v in freq_dict.items()}

    return freq_dict

In [30]:
apple_genre_freq = freq_table(apple_data, -5)
sorted(apple_genre_freq.items(), key=lambda kv: kv[1])

[('Catalogs', 0.12410797393732546),
 ('Navigation', 0.18616196090598822),
 ('Medical', 0.18616196090598822),
 ('Book', 0.43437790878063914),
 ('Business', 0.5274588892336333),
 ('Reference', 0.5584858827179646),
 ('Food & Drink', 0.8067018305926157),
 ('Weather', 0.8687558175612783),
 ('Finance', 1.1169717654359292),
 ('Travel', 1.2410797393732547),
 ('News', 1.3341607198262488),
 ('Lifestyle', 1.5823766677008997),
 ('Productivity', 1.7375116351225566),
 ('Health & Fitness', 2.016754576481539),
 ('Music', 2.04778156996587),
 ('Sports', 2.140862550418864),
 ('Utilities', 2.513186472230841),
 ('Shopping', 2.6062674526838348),
 ('Social Networking', 3.288861309339125),
 ('Education', 3.661185231151101),
 ('Photo & Video', 4.964318957493019),
 ('Entertainment', 7.880856345020168),
 ('Games', 58.144585789636984)]

we need a list of genres for traversal

In [31]:
apple_unique_genres = list(apple_genre_freq.keys())
apple_unique_genres

['Music',
 'Navigation',
 'Weather',
 'Utilities',
 'Sports',
 'News',
 'Travel',
 'Medical',
 'Business',
 'Food & Drink',
 'Productivity',
 'Social Networking',
 'Book',
 'Lifestyle',
 'Entertainment',
 'Finance',
 'Health & Fitness',
 'Photo & Video',
 'Shopping',
 'Reference',
 'Education',
 'Catalogs',
 'Games']

### genre_ratings, len_genre and avg_ratings are lists to hold values for each unique genre where the index for each genre is same as its index in apple_unique_genres
traversing through the data set, we increment the count for genre in len_genre at its index. We also add the rating to its index in genre_ratings

In [32]:
genre_ratings = [0] * len(apple_unique_genres)
len_genre = [0] * len(apple_unique_genres)
avg_ratings = [0] * len(apple_unique_genres)

for row in apple_data[1:]:
    len_genre[apple_unique_genres.index(row[-5])] += 1
    genre_ratings[apple_unique_genres.index(row[-5])] += float(row[7])
    
avg_ratings = [a / b for a,b in zip(genre_ratings,len_genre)]
   

prints average ratig for each app

In [39]:
apple_sorted_dict = display_dict(dict(zip(apple_unique_genres,avg_ratings)))
apple_sorted_dict

[('Medical', 3.0),
 ('Sports', 3.0652173913043477),
 ('Book', 3.0714285714285716),
 ('News', 3.244186046511628),
 ('Finance', 3.375),
 ('Lifestyle', 3.411764705882353),
 ('Weather', 3.482142857142857),
 ('Travel', 3.4875),
 ('Utilities', 3.5308641975308643),
 ('Entertainment', 3.5393700787401574),
 ('Social Networking', 3.5943396226415096),
 ('Food & Drink', 3.6346153846153846),
 ('Education', 3.635593220338983),
 ('Reference', 3.6666666666666665),
 ('Health & Fitness', 3.769230769230769),
 ('Navigation', 3.8333333333333335),
 ('Photo & Video', 3.903125),
 ('Music', 3.946969696969697),
 ('Shopping', 3.9702380952380953),
 ('Business', 3.9705882352941178),
 ('Productivity', 4.0),
 ('Games', 4.037086446104589),
 ('Catalogs', 4.125)]

### Repeating same for android apps data set

In [35]:
google_freq = freq_table(google_data, 1)
google_unique_categories = list(google_freq.keys())

Profitable apps can be categorized on the basis of number of installs. since this information was missing in IOS apps data set so we used user ratings for that purpose. For android apps data set, we will use numberof installs.
No of installs are enclosed in paranthesis and has a + at the end so we must replace them before converting it to float.

In [40]:
category_installs = [0] * len(google_unique_categories)
len_category = [0] * len(google_unique_categories)
avg_installs = [0] * len(google_unique_categories)

for row in google_data[1:]:
    len_category[google_unique_categories.index(row[1])] += 1
    category_installs[google_unique_categories.index(row[1])] += float(row[5].replace("+","").replace(",",""))
    
avg_installs = [a/b for a,b in zip(category_installs, len_category)]
google_sorted_dict = display_dict(dict(zip(google_unique_categories, avg_installs)))
google_sorted_dict

[('MEDICAL', 118007.56129032258),
 ('EVENTS', 253542.22222222222),
 ('BEAUTY', 513151.88679245283),
 ('LIBRARIES_AND_DEMO', 524339.1463414634),
 ('PARENTING', 542603.6206896552),
 ('DATING', 594694.3040540541),
 ('AUTO_AND_VEHICLES', 647317.8170731707),
 ('COMICS', 817657.2727272727),
 ('FINANCE', 853634.7875),
 ('HOUSE_AND_HOME', 1288606.5857142857),
 ('LIFESTYLE', 1414198.921511628),
 ('BUSINESS', 1577920.8188585609),
 ('EDUCATION', 1671782.1782178218),
 ('FOOD_AND_DRINK', 1741556.3823529412),
 ('ART_AND_DESIGN', 1986335.0877192982),
 ('FAMILY', 2608698.339460784),
 ('TRAVEL_AND_LOCAL', 3335196.4120603013),
 ('SPORTS', 3441120.959322034),
 ('NEWS_AND_MAGAZINES', 3539576.5714285714),
 ('HEALTH_AND_FITNESS', 3877077.2803030303),
 ('MAPS_AND_NAVIGATION', 4056941.7741935486),
 ('WEATHER', 5075550.285714285),
 ('PERSONALIZATION', 5168616.05479452),
 ('SHOPPING', 5851495.2590673575),
 ('BOOKS_AND_REFERENCE', 7801001.4130434785),
 ('GAME', 9344242.86592866),
 ('TOOLS', 10576465.782841822),


In [44]:
apple_sorted_dict[-5:]

[('Shopping', 3.9702380952380953),
 ('Business', 3.9705882352941178),
 ('Productivity', 4.0),
 ('Games', 4.037086446104589),
 ('Catalogs', 4.125)]

In [45]:
google_sorted_dict[-5:]

[('PHOTOGRAPHY', 12137075.26),
 ('PRODUCTIVITY', 14651850.923529413),
 ('SOCIAL', 16945323.882882882),
 ('COMMUNICATION', 18220566.670542635),
 ('VIDEO_PLAYERS', 21221221.146496814)]

### we can see that games are popular on both platforms. We can further explore the genres by looking at the apps in these genres which is quite simple