## Profitable App Profile for Apple iStore

In this project, I will explore the applications available on Apple IStore and their usage pattern. A dataset for 10k apps were downloaded from Apple iStore for the analysis

Goal of this project is to find which genre of applications are more profitable on Apple App Store. 

### Below function returns the data of csv as a list of list

In [132]:
rootFolder = 'C:\\Users\\amit.kuma\\Desktop\\Cloud Drives\\OneDrive\\Code Base\\Github\\mydatasets'
from csv import reader
def getDataSet(fileName):
    opened_file = open(rootFolder+'\\'+fileName,encoding='utf8')
    reader_data = reader(opened_file)
    return list(reader_data)


### Below function provides insights to the dataset. You can slide the current datset as well as get information about number of rows and columns

In [133]:
def explore_data(dataset,start,end,rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print('\n')
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ',  len(dataset))
        print('Number of columns: ', len(dataset[0]))


### For the current project, we've downloaded two csv files which contains 

- Google app store data [link] (https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)
- Apple app store data [link] (https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [134]:
googleStoreData = getDataSet('googleplaystore.csv')
appleStoreData = getDataSet('AppleStore.csv')
explore_data(googleStoreData,0,0,True)
explore_data(appleStoreData,0,0,True)


Number of rows:  10842
Number of columns:  13
Number of rows:  7198
Number of columns:  16


### finding bad entries (rows with less or more number of columns as that of header row) in the dataset and deleting them.

In [135]:
def deleteEntryforColumn(dataset):
    columnLen = len(dataset[0])
    count = 0
    for row in dataset[1:]:
        count += 1
        colLen = len(row)
        if colLen != columnLen:
            del dataset[count]
            print('Row deleted. Index No : ', count)    
deleteEntryforColumn(googleStoreData)
deleteEntryforColumn(appleStoreData)


Row deleted. Index No :  10473


### finding duplicate entries in the datasets

In [136]:
def findDuplicateEntries(dataset,nameIndex):
    count = 0
    duplicateapp = []
    uniqueapp = []
    for row in dataset[1:]:
        count += 1
        appName = row[nameIndex]
        if appName in uniqueapp:
            duplicateapp.append(appName)
        else:
            uniqueapp.append(appName)
    return duplicateapp,uniqueapp

print(findDuplicateEntries(appleStoreData,1)[0])

    


['Mannequin Challenge', 'VR Roller Coaster']


we can see that there are `1181` duplicate apps in google store data and `2` duplicate entries in apple store data.

Inspecting these duplicate apps carefully, we can see that they varise only based on the number of reviews e.g. for applestore `Mannequin Challenge` app is a duplicate and duplicate sets are below


> `['1173990889', 'Mannequin Challenge', `*'109705216'*`, 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']`

> `['1178454060', 'Mannequin Challenge', `*'59572224'*`, 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']`

so we can pick that row which as highest number of reviews as that implies it is the latest data

### Removing duplicate entries based on our criteria from the datasets and get clean dataset

In [137]:
def removeDuplicates(dataset,nameIndex,reviewIndex):
    appdict = {}
    count = 0
    for row in dataset[1:]:
        appName = row[nameIndex]
        n_reviews = float(row[reviewIndex])
        if appName not in appdict:
            appdict[appName] = n_reviews
        if appName in appdict and appdict[appName] < n_reviews:
            appdict[appName] = n_reviews
    app_clean = []
    already_added = []
    count = 0
    for row in dataset[1:]:
        appName = row[nameIndex]
        n_reviews = float(row[reviewIndex])
        if appName not in already_added and n_reviews == appdict[appName]:
            app_clean.append(row) 
            already_added.append(appName)           
    return app_clean

googleCleanData = removeDuplicates(googleStoreData,0,3)
appleCleanData = removeDuplicates(appleStoreData,1,4)


### preparing and filtering dataset to remove non-english apps

In [140]:
def isEnglishString(inputStr):
    count = 1
    for character in inputStr:
        if count > 3:
            return False        
        if ord(character) > 127:
            count += 1            
    return True

def getOnlyEngApp(dataset,nameIndex):
    engappdata = []
    for row in dataset:
        appName = row[nameIndex]
        if isEnglishString(appName):
           engappdata.append(row)
    return engappdata

googleEngData = getOnlyEngApp(googleCleanData,0)
appleEngData = getOnlyEngApp(appleCleanData,1)


### since current scope of analysis is limited to free apps only, we've to further filter our dataset to remove premium apps

In [141]:
def removePaidApps(dataset,amountIndex,typeFlag=False):
    freeAppData = []
    for row in dataset:
        if typeFlag and row[amountIndex] == 'Free':
           freeAppData.append(row)
        elif typeFlag == False and float(row[amountIndex]) == 0.0:
           freeAppData.append(row)
    return freeAppData

googleFreeApp = removePaidApps(googleEngData,6,True)
appleFreeApp = removePaidApps(appleEngData,4)



- our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

- To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

- Build a minimal Android version of the app, and add it to Google Play.
    If the app has a good response from users, we develop it further.
    If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.
    
- Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

In [172]:
def frequencyTable(dataset,index):
    curDict = {}
    for row in dataset:
        curkey = row[index]
        if curkey in curDict:
            curDict[curkey] +=1
        else:
            curDict[curkey] = 1
    return curDict


def display_table(freqTable, index=0):
    table = freqTable
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#print(googleStoreData[0]) #1, 9
#print(appleStoreData[0]) #11,5
#print(display_table(googleFreeApp,1))
print('\n')
#print(display_table(googleFreeApp,9))
print('\n')
#print(display_table(appleFreeApp,11))

prime_genre_dict = frequencyTable(appleFreeApp,11)
prime_genre_rating_dict = {}
for genre in prime_genre_dict:
    total_rating = 0
    len_genre = 0
    for row in appleFreeApp:
        if genre == row[11]:
            total_rating += int(row[5])
            len_genre += 1

    prime_genre_rating_dict[genre] = int(total_rating/len_genre)

#display_table(prime_genre_rating_dict)

category_dict = frequencyTable(googleFreeApp,1)
category_dict_install = {}
for category in category_dict:
    total_install = 0
    len_category = 0
    for row in googleFreeApp:
        if category == row[1]:
           total_install += float(row[5].replace('+','').replace(',',''))
           len_category += 1
    category_dict_install[category] = int(total_install/len_category)
display_table(category_dict_install)









COMMUNICATION : 38590581
VIDEO_PLAYERS : 24727872
SOCIAL : 23253652
PHOTOGRAPHY : 17840110
PRODUCTIVITY : 16787331
GAME : 15606004
TRAVEL_AND_LOCAL : 13984077
ENTERTAINMENT : 11640705
TOOLS : 10830251
NEWS_AND_MAGAZINES : 9549178
BOOKS_AND_REFERENCE : 8814199
SHOPPING : 7036877
PERSONALIZATION : 5201482
WEATHER : 5145550
HEALTH_AND_FITNESS : 4188821
MAPS_AND_NAVIGATION : 4049274
FAMILY : 3697848
SPORTS : 3650602
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924897
EDUCATION : 1833495
BUSINESS : 1712290
LIFESTYLE : 1446158
FINANCE : 1387692
HOUSE_AND_HOME : 1360598
DATING : 854028
COMICS : 832613
AUTO_AND_VEHICLES : 647317
LIBRARIES_AND_DEMO : 638503
PARENTING : 542603
BEAUTY : 513151
EVENTS : 253542
MEDICAL : 120550
