# Profitable App Profiles on Apple and Android Store

*Goal of this project is to analyze data to help the developers understand the profile of the apps that are likely to attract more users. Most of the apps are free to download so main source of revenue comes from in-app ads. The more number of users engage and see the ads, the more revenue flows in.*

## Exploring the data

*As of Sep 2018, there were 2 million iOS apps available on Apple Store and 2.1 million Android Apps open on Google Play Store. These datasets are a sample of the apps data. Google Play Store data contains about 10800 apps data and Apple Store data contains about 7500 apps data.*

In [4]:
# Open Apple Store data 
from csv import reader
opened_file= open('AppleStoreData.csv',encoding='UTF-8') #Important to specify the encoding of the Input file
read_file = reader(opened_file)
ios_data = list(read_file)
ios_header=ios_data[0] # Specify header of ios data
ios = ios_data[1:] # All the ios data except for the header 

In [5]:
# Open Google Play Store data 
from csv import reader
opened_file= open('googleplaystore.csv',encoding='UTF-8') #Important to specify the encoding of the Input file
read_file = reader(opened_file)
playstore_data = list(read_file)
android_header=playstore_data[0] # Specify header of ios data
android = playstore_data[1:] # All the ios data except for the header 

In [6]:
#Creating a explore data function 
def explore_data(dataset,start,end,rows_columns = False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # Adds a new empty line after each row
    if rows_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:',len(dataset[0]))
        

In [7]:
# Explore ios data
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


*Apple Store data has 7197 records with 16 variables. Some of the important variables which will help in analysing are 'Track_Name', 'price', 'rating_count_tot', 'user_rating', 'prime_genre'*

In [8]:
# Explore playstore data
print(android_header)
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


*Google Play Store data has 10841 records with 13 variables. Some of the important variables which will help in analysing are 'App', 'Genres', 'Rating', 'Reviews', 'Type', 'Installs', 'Price'*

### ios data dictionary

## Data Cleaning

### Playstore data cleaning
*Through data discussions on the playstore community:
    * Wrong entry 10472 row if header is not included as Category is not present and column shift has happened
    * Remove Non English Apps
    * Remove apps that are not free

In [9]:
print(android_header)
print('\n')
explore_data(android,10472,10473)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




As can be seen, there is no value for category in row 10472 and hence there is a column shift. Under Category, we see the value as 1.9 which is incorrect. Need to delete this row to clean up the playstore data.

In [10]:
#Deleting the row 10472
del android[10472]

In [11]:
explore_data(android,0,2,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


So, now the number of rows in the android is 10840 after deleting a row of data

## Removing Duplicate Entries
Checking if the Android data has duplicate entries of the apps and see the best way to remove the data

In [12]:
#Instagram has 4 entries in android
for app in android:
    name = app[0]
    if name == "Instagram":
        print(app)


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


As can be seen, the only difference is the column 4 which shows the number of reviews. Rather than removing random duplicates, its better if we can keep the row of data with highest number of reviews as that's the latest data. The other rows of duplicate data can be removed. 
In total, there are 1,181 cases where an app occurs more than once.

In [13]:
#Counting the duplicate apps
unique_apps=[]
duplicate_apps=[]

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of Duplicate apps:', len(duplicate_apps))
print('\n')    
print('Example of Duplicate apps:', duplicate_apps[:5])



Number of Duplicate apps: 1181


Example of Duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [14]:
#Number of rows remaining in the data which are unique
print("Number of Unique apps in Android data:", len(android)-len(duplicate_apps))

Number of Unique apps in Android data: 9659


### Creating dictionary to remove duplicates
To remove the duplicates, we will:

* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
* Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [15]:
## Create a dictionary with unique app name and the number of reviews for the app
reviews_max={}
for app in android:
    name = app[0]
    n_reviews=float(app[3])
    if name in reviews_max and reviews_max[name]<n_reviews: #if app already present in the dictionary, update reviews value
        reviews_max[name]=n_reviews
    elif name not in reviews_max: #Else use will update the less value of reviews
        reviews_max[name]=n_reviews


len(reviews_max)


9659

* Start by creating two empty lists: android_clean (which will store our new cleaned data set) and already_added (which will just store app names).
* Loop through the Google Play data set (make sure you don't include the header row), and for each iteration:
* Assign the app name to a variable named name.
* Convert the number of reviews to float, and assign it to a variable named n_reviews.
* If n_reviews is the same as the number of maximum reviews of the app name (the number can be found in the reviews_max dictionary) and name is not already in the list already_added 
* Append the entire row to the android_clean list (which will eventually be a list of list and store our cleaned data set).
* Append the name of the app name to the already_added list — this helps us to keep track of apps that we already added.

In [16]:
# Creating two lists
android_clean=[]
already_added=[]
for app in android:
    name = app[0]
    n_reviews=float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

In [17]:
explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


In [18]:
##Checking sample data for Android where app = Instagram
for app in android_clean:
    name = app[0]
    if name == "Instagram":
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


As can be seen, Instagram has only one record now with the max reviews of 66577446 number of reviews out of the 4 records which was present earlier. 

In [19]:
#Checking duplicate apps in ios based on App Name
unique_apps=[]
duplicate_apps=[]

for app in ios:
    name = app[1]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of Duplicate apps:', len(duplicate_apps))
print('\n')    
print('Example of Duplicate apps:', duplicate_apps[:5])

Number of Duplicate apps: 2


Example of Duplicate apps: ['Mannequin Challenge', 'VR Roller Coaster']


In [20]:
##Checking sample data for Android where app = Instagram
for app in ios:
    name = app[1]
    if name == "Mannequin Challenge" or name == "VR Roller Coaster":
        print(app)

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


But, as we can see, all the values are quite different and hence they are not duplicates. Lets check the duplicates based on the app id.

In [21]:
#Checking duplicate apps in ios based on App Name
unique_apps=[]
duplicate_apps=[]

for app in ios:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of Duplicate apps:', len(duplicate_apps))
print('\n')    
print('Example of Duplicate apps:', duplicate_apps[:5])

Number of Duplicate apps: 0


Example of Duplicate apps: []


Conclusion: There are no duplicate apps in ios data.

## Removing Non English Apps
Only English Apps to be considered. Any other language apps needs to be removed.

In [22]:
# Some examples of non English apps based on the discussion in the community
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


In [23]:
# There is a corresponding number associated with each character in Python
print(ord('a'))
print(ord('A'))
print(ord('5'))
print(ord('+'))
print(ord('中'))
print(ord('ス'))

97
65
53
43
20013
12473


The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

In [24]:
# Writing a function which checks whether character within the strings are english characters or not
# Strings are indexable and iterable
def check_english(input_string):
    for i in input_string:
        if ord(i) >127:
            return False
        
    return True

In [25]:
print(check_english('Instagram'))
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

True
False
False
False


Based on the logic, the last two apps should have been True. But Emojies or characters such as ™ will lead to removal of even english apps. Many english apps will be labeled as non English. Checking the ords of these characters:

In [26]:
print(ord('😜'))
print(ord('™'))

128540
8482


To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [27]:
#Rewriting the function to handle the more than three characters logic
def check_english(input_string):
    asc_ii=0
    for i in input_string:
        if ord(i) >127:
            asc_ii+=1
    
    if asc_ii > 3:
        return False
    else:
        return True

In [28]:
print(check_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_english('Docs To Go™ Free Office Suite'))
print(check_english('Instachat 😜'))

False
True
True


This function is still not perfect but still as can be seen, it will remove all the non English apps and there might be a small chance it removes some of the english apps too. 

In [29]:
#Using this function to remove all non english apps in ios data
ios_english=[]
ios_non_english=[]
for app in ios:
    name = app[1]# App name is indexed at 1
    if check_english(name)==True:
        ios_english.append(app)
    else:
        ios_non_english.append(app)   

In [30]:
#Exploring both ios english and ios non english list
explore_data(ios_english,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns: 16


In [31]:
explore_data(ios_non_english,0,3,True)

['445375097', '爱奇艺PPS -《欢乐颂2》电视剧热播', '224617472', 'USD', '0.0', '14844', '0', '4.0', '0.0', '6.3.3', '17+', 'Entertainment', '38', '5', '3', '1']


['405667771', '聚力视频HD-人民的名义,跨界歌王全网热播', '90725376', 'USD', '0.0', '7446', '8', '4.0', '4.5', '5.0.8', '12+', 'Entertainment', '24', '4', '1', '1']


['336141475', '优酷视频', '204959744', 'USD', '0.0', '4885', '0', '3.5', '0.0', '6.7.0', '12+', 'Entertainment', '38', '0', '2', '1']


Number of rows: 1014
Number of columns: 16


In [32]:
#Using this function to remove all non english apps in android clean data
android_english=[]
android_non_english=[]
for app in android_clean:
    name = app[0] #App Name is indexed at 0
    if check_english(name)==True:
        android_english.append(app)
    else:
        android_non_english.append(app)  

In [33]:
#Exploring both android english and ios non english list
explore_data(android_english,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


In [34]:
explore_data(android_non_english,0,3,True)

['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'July 26, 2018', '3.3', '4.1 and up']


['သိင်္ Astrology - Min Thein Kha BayDin', 'LIFESTYLE', '4.7', '2225', '15M', '100,000+', 'Free', '0', 'Everyone', 'Lifestyle', 'July 26, 2018', '4.2.1', '4.0.3 and up']


['РИА Новости', 'NEWS_AND_MAGAZINES', '4.5', '44274', '8.0M', '1,000,000+', 'Free', '0', 'Everyone', 'News & Magazines', 'August 6, 2018', '4.0.6', '4.4 and up']


Number of rows: 45
Number of columns: 13


Looking at both the datasets, after removing inaccurate data, duplicates and non english apps, the datasets have following number of rows:
* ios_english: 6183 rows
* android_english: 9614 rows

### Removing Paid apps
Last Step is to remove the Paid Apps as we are only concerned about the apps which are free to install. In ios, the price is at index 4. If price is 0.0 then its free. In Android, Type column determines whether an app is free or paid. If Type == "Free" then its free app to install in android.


In [35]:
#Using this function to remove all paid apps in ios data
ios_free=[]
ios_paid=[]
for app in ios_english:
    price = app[4]# App name is indexed at 4
    if price == '0.0':
        ios_free.append(app)
    else:
        ios_paid.append(app)  

In [36]:
#Exploring both ios free and paid data list
explore_data(ios_free,0,3,True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 3222
Number of columns: 16


In [37]:
explore_data(ios_paid,0,3,True)

['362949845', 'Fruit Ninja Classic', '104590336', 'USD', '1.99', '698516', '132', '4.5', '4.0', '2.3.9', '4+', 'Games', '38', '5', '13', '1']


['500116670', 'Clear Vision (17+)', '37879808', 'USD', '0.99', '541693', '69225', '4.5', '4.5', '1.1.3', '17+', 'Games', '43', '5', '1', '1']


['479516143', 'Minecraft: Pocket Edition', '147787776', 'USD', '6.99', '522012', '1148', '4.5', '4.5', '1.1', '9+', 'Games', '37', '1', '11', '1']


Number of rows: 2961
Number of columns: 16


In [38]:
#Using this function to remove all paid apps in android data
android_free=[]
android_paid=[]
for app in android_english:
    price = app[6]# Price type is indexed at 6
    if price == 'Free':
        android_free.append(app)
    else:
        android_paid.append(app)  

In [39]:
#Exploring both ios free and paid list
explore_data(android_free,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8863
Number of columns: 13


In [40]:
explore_data(android_paid,0,3,True)

['TurboScan: scan documents and receipts in PDF', 'BUSINESS', '4.7', '11442', '6.8M', '100,000+', 'Paid', '$4.99', 'Everyone', 'Business', 'March 25, 2018', '1.5.2', '4.0 and up']


['Tiny Scanner Pro: PDF Doc Scan', 'BUSINESS', '4.8', '10295', '39M', '100,000+', 'Paid', '$4.99', 'Everyone', 'Business', 'April 11, 2017', '3.4.6', '3.0 and up']


['Puffin Browser Pro', 'COMMUNICATION', '4.0', '18247', 'Varies with device', '100,000+', 'Paid', '$3.99', 'Everyone', 'Communication', 'July 5, 2018', '7.5.3.20547', '4.1 and up']


Number of rows: 751
Number of columns: 13


Looking at both the datasets, after removing paid apps, data cleansing is completed and the datasets have following number of rows:
* ios_free: 3222 rows
* android_free: 8863 rows

## Most Common Genres for both ios and Android
Goal is to determine the kinds of apps that are likely to attract more users because revenue is highly influenced by the number of people using our apps.

Validation Strategy:
* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For ex. Productivity apps on both store.

Generate frequency table for most common genres in the market:
* ios_free: prime_genre (Index 11)
* Android_free: Category (Index = 1), Genres(Index=9)

Both Genres and Category in Android seem to be same. Lets check that.

We'll build two functions we can use to analyze the frequency tables:

* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in a descending order


Dictionaries don't have orders so it will be very difficult to analyze the frequency tables. A second function is needed which can help  display the entries in the frequency table in a descending order. The sorted function doesn't work well with dictionaries but if dictionary can be converted into tuples, where dictionary value comes first and dictionary key comes second. 
* Freq_Table={Genre_1:50,Genre_2:70,Genre_3:100}
* Freq_Table_Tuple=[(50,Genre_1),(70,Genre_2),(100,Genre_3)]
* sorted(freq_table_tuple,reverse=TRUE)
* Result = [(100,Genre_3),(70,Genre_2),(50,Genre_1)]

To write the function, we have the following requirements:
* Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.
* Generates a frequency table using the freq_table() function (which you're going to write as an exercise).
* Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
* Prints the entries of the frequency table in descending order.

In [72]:
## Creating first function which generates the frequency table for Genres
def freq_table(dataset,index):
    dict_count={}
    table_percentages={}
    total=0
    
    for row in dataset:
        total+=1
        name = row[index]
        if name in dict_count:
            dict_count[name]+=1
        else:
            dict_count[name]=1

    for key in dict_count:
        percentage = round((dict_count[key]/total)*100,2)
        table_percentages[key]=percentage
        
    return table_percentages

In [73]:
# Category in Android data
freq_table(ios_free,11)

{'Social Networking': 3.29,
 'Photo & Video': 4.97,
 'Games': 58.16,
 'Music': 2.05,
 'Reference': 0.56,
 'Health & Fitness': 2.02,
 'Weather': 0.87,
 'Utilities': 2.51,
 'Travel': 1.24,
 'Shopping': 2.61,
 'News': 1.33,
 'Navigation': 0.19,
 'Lifestyle': 1.58,
 'Entertainment': 7.88,
 'Food & Drink': 0.81,
 'Sports': 2.14,
 'Book': 0.43,
 'Finance': 1.12,
 'Education': 3.66,
 'Productivity': 1.74,
 'Business': 0.53,
 'Catalogs': 0.12,
 'Medical': 0.19}

In [74]:
# Genres in Android data
freq_table(android_free,9)

{'Art & Design': 0.6,
 'Art & Design;Creativity': 0.07,
 'Auto & Vehicles': 0.93,
 'Beauty': 0.6,
 'Books & Reference': 2.14,
 'Business': 4.59,
 'Comics': 0.61,
 'Comics;Creativity': 0.01,
 'Communication': 3.24,
 'Dating': 1.86,
 'Education': 5.35,
 'Education;Creativity': 0.05,
 'Education;Education': 0.34,
 'Education;Pretend Play': 0.06,
 'Education;Brain Games': 0.03,
 'Entertainment': 6.07,
 'Entertainment;Brain Games': 0.08,
 'Entertainment;Creativity': 0.03,
 'Entertainment;Music & Video': 0.17,
 'Events': 0.71,
 'Finance': 3.7,
 'Food & Drink': 1.24,
 'Health & Fitness': 3.08,
 'House & Home': 0.82,
 'Libraries & Demo': 0.94,
 'Lifestyle': 3.89,
 'Lifestyle;Pretend Play': 0.01,
 'Card': 0.45,
 'Arcade': 1.85,
 'Puzzle': 1.13,
 'Racing': 0.99,
 'Sports': 3.46,
 'Casual': 1.76,
 'Simulation': 2.04,
 'Adventure': 0.68,
 'Trivia': 0.42,
 'Action': 3.1,
 'Word': 0.26,
 'Role Playing': 0.94,
 'Strategy': 0.9,
 'Board': 0.38,
 'Music': 0.2,
 'Action;Action & Adventure': 0.1,
 'Casua

In [76]:
# Converting frequency table into Tuple and Sorting the data
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [77]:
display_table(ios_free,11)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12


While Analyzing the free English Apps on ios store, here are some of the most interesting insights:
* Games is the most common genre with 58% share i.e. more than half.
* The next most common genre is Entertainment which has 7.88% share.
* Most common app profiles on ios store are Games, Entertainment, Photo and Videos, Social Networking which falls under fun apps while important educational or productivity apps are far more rare. They might be present on the ios store because of the users demand but that does not mean that having more apps in those genres on ios store means they have more users. 

In [78]:
#Understanding the Category in Android
display_table(android_free,1)

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6


While Analyzing the free English Apps on Android store, here are some of the most interesting insights:
* Family is the most common Category with ~19% share.
* The next most common genre is Game which has 9.73% share.
* Most common app profiles on Android store are Family, Game, tools, business  which means Android store has a good mix of apps for fun and practical purposes. But that can't be proved unless we come to know which apps fall under Family Category, it might be a case where there are a lot of games under Family Category.

In [80]:
#Understanding the Genre in Android free apps
display_table(android_free,9)

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.9
House & Home : 0.82
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.6
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Board : 0.38
Educational : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual;B

The difference between the Genres and the Category columns is not crystal clear, but one thing we can notice is that the Genres column is much more granular (it has more categories). We're only looking for the bigger picture at the moment, so we'll only work with the Category column moving forward.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps. Now we'd like to get an idea about the kind of apps that have most users.

## Most Popular Apps by Genre on iOS and Android Store 

We can figure out the most popular apps by genre on both stores by checking the number of app installs. In case of iOS, we dont have information on number of installs so we can use rating_count_tot while android store has installs columns.
* ios - Rating_count_tot(Index=5)
* Android - Installs (Index=5)

Let's start with calculating the average number of user ratings per app genre on the App Store. To do that, we'll need to:

* Isolate the apps of each genre.
* Sum up the user ratings for the apps of that genre. User Rating Index is 7.
* Divide the sum by the number of apps belonging to that genre (not by the total number of apps).

Eg. Sum up user rating of Games say that comes out to 100. If total number of apps in Games is 200, avg user rating would be 100/200 = 0.5.

In [83]:
## Creating  function which generates the frequency table for Genres
def freq_table(dataset,index):
    dict_count={}
    
    for row in dataset:
        name=row[index]
        if name in dict_count:
            dict_count[name]+=1
        else:
            dict_count[name]=1
    return dict_count
        
        

In [84]:
ios_genres=freq_table(ios_free,11)
ios_genres

{'Social Networking': 106,
 'Photo & Video': 160,
 'Games': 1874,
 'Music': 66,
 'Reference': 18,
 'Health & Fitness': 65,
 'Weather': 28,
 'Utilities': 81,
 'Travel': 40,
 'Shopping': 84,
 'News': 43,
 'Navigation': 6,
 'Lifestyle': 51,
 'Entertainment': 254,
 'Food & Drink': 26,
 'Sports': 69,
 'Book': 14,
 'Finance': 36,
 'Education': 118,
 'Productivity': 56,
 'Business': 17,
 'Catalogs': 4,
 'Medical': 6}

For example, if we want to count the average rating for each genre:

    App      Genre     Rating_Count_tot    
    Game1    Game      100
    Game2    Game      50
    Game3    Game      150
    
In this case, Total would be 300 and length of Game genre would be 3. Avg Rating would be 300/3 i.e. 100

In [88]:
# Looping over ios genres
for genre in ios_genres:
    total = 0 # Sum of User ratings i.e. Number of user ratings i.e ratings count total and not the actual ratings
    len_genre = 0 # Number of Apps specific to each genre
    for row in ios_free:
        genre_app=row[11]
        if genre_app == genre:
            total+= float(row[5]) # Add User ratings
            len_genre+=1
    
    avg_ratings= round(total/len_genre,2)
    print(genre,":",avg_ratings) 
            

Social Networking : 71548.35
Photo & Video : 28441.54
Games : 22788.67
Music : 57326.53
Reference : 74942.11
Health & Fitness : 23298.02
Weather : 52279.89
Utilities : 18684.46
Travel : 28243.8
Shopping : 26919.69
News : 21248.02
Navigation : 86090.33
Lifestyle : 16485.76
Entertainment : 14029.83
Food & Drink : 33333.92
Sports : 23008.9
Book : 39758.5
Finance : 31467.94
Education : 7003.98
Productivity : 21028.41
Business : 7491.12
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [93]:
for row in ios_free:
    if row[11] == "Navigation":
        print(row[1],':',row[5])
    

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


In [94]:
for row in ios_free:
    if row[11] == "Reference":
        print(row[1],':',row[5])
    

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


We can see a lot of religious and productivity apps under reference genre in ios data which has a lot of ratings. Specially Bible, Dictionary and Google Translate.

*Facebook and Pinterest heavily influence Social Networking genre. The numbers are very skewed because of heavy usage of few apps.Reference apps have around 74492 user ratings on average but heavily tilted by Bible and Dictionary.com user ratings. However, this niche seems to show some potential. One thing we could do is take another popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so users don't need to exit our app to look up words in an external app.
This idea seems to fit well with the fact that the App Store is dominated by for-fun apps. This suggests the market might be a bit saturated with for-fun apps, which means a practical app might have more of a chance to stand out among the huge number of apps on the App Store.
Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:
Weather apps — people generally don't spend too much time in-app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.
Food and drink — examples here include Starbucks, Dunkin' Donuts, McDonald's, etc. So making a popular food and drink app requires actual cooking and a delivery service, which is outside the scope of our company.
Finance apps — these apps involve banking, paying bills, money transfer, etc. Building a finance app requires domain knowledge, and we don't want to hire a finance expert just to build an app.*

### Android Store Analysis - Most Popular apps by genre on Google Play

In [95]:
## Checking the number of installs
display_table(android_free,5) 
## This shows the installs numbers are not precise enough since its 100+ or
## 10000+ etc. We will still consider the numbers as it is such as 100+ 
## be considered 100 while 10,000+ would be considered 10000. The string
## would have to be converted into float (removing commas and + sign).

1,000,000+ : 1394
100,000+ : 1024
10,000,000+ : 935
10,000+ : 904
1,000+ : 744
100+ : 613
5,000,000+ : 605
500,000+ : 493
50,000+ : 423
5,000+ : 400
10+ : 314
500+ : 288
50,000,000+ : 204
100,000,000+ : 189
50+ : 170
5+ : 70
1+ : 45
500,000,000+ : 24
1,000,000,000+ : 20
0+ : 4


In [96]:
## Example ---- To remove characters from strings
n_installs = '100,000+'
print(n_installs.replace('+','plus'))
print(n_installs.replace('1','one'))
print(n_installs.replace('&','ampersand')) ## No change

## To remove certain characters, we can replace them with the empty string
## '':
print(n_installs.replace('+',''))
n_installs=n_installs.replace('+','')
print(n_installs)
n_installs=n_installs.replace(',','')
print(n_installs)

100,000plus
one00,000+
100,000+
100,000
100,000
100000


In [98]:
category_android = freq_table(android_free,1)
category_android

{'ART_AND_DESIGN': 57,
 'AUTO_AND_VEHICLES': 82,
 'BEAUTY': 53,
 'BOOKS_AND_REFERENCE': 190,
 'BUSINESS': 407,
 'COMICS': 55,
 'COMMUNICATION': 287,
 'DATING': 165,
 'EDUCATION': 103,
 'ENTERTAINMENT': 85,
 'EVENTS': 63,
 'FINANCE': 328,
 'FOOD_AND_DRINK': 110,
 'HEALTH_AND_FITNESS': 273,
 'HOUSE_AND_HOME': 73,
 'LIBRARIES_AND_DEMO': 83,
 'LIFESTYLE': 346,
 'GAME': 862,
 'FAMILY': 1675,
 'MEDICAL': 313,
 'SOCIAL': 236,
 'SHOPPING': 199,
 'PHOTOGRAPHY': 261,
 'SPORTS': 301,
 'TRAVEL_AND_LOCAL': 207,
 'TOOLS': 750,
 'PERSONALIZATION': 294,
 'PRODUCTIVITY': 345,
 'PARENTING': 58,
 'WEATHER': 71,
 'VIDEO_PLAYERS': 159,
 'NEWS_AND_MAGAZINES': 248,
 'MAPS_AND_NAVIGATION': 124}

In [100]:
## Loop over unique categories
for category in category_android:
    total=0 ## To count the total of each category
    len_category=0 ## To count the number of apps in each category
    for app in android_free:
        category_app=app[1]
        if category_app == category:
            n_installs=app[5]
            n_installs=n_installs.replace('+','')
            n_installs=n_installs.replace(',','')
            n_install_float=float(n_installs)
            total+=n_install_float
            len_category+=1
    ## Calculate the average number of installs
    avg_installs=round(total/len_category,2)
    print(category,':',avg_installs)
    

ART_AND_DESIGN : 1986335.09
AUTO_AND_VEHICLES : 647317.82
BEAUTY : 513151.89
BOOKS_AND_REFERENCE : 8767811.89
BUSINESS : 1712290.15
COMICS : 817657.27
COMMUNICATION : 38456119.17
DATING : 854028.83
EDUCATION : 1833495.15
ENTERTAINMENT : 11640705.88
EVENTS : 253542.22
FINANCE : 1387692.48
FOOD_AND_DRINK : 1924897.74
HEALTH_AND_FITNESS : 4188821.99
HOUSE_AND_HOME : 1331540.56
LIBRARIES_AND_DEMO : 638503.73
LIFESTYLE : 1437816.27
GAME : 15588015.6
FAMILY : 3697848.17
MEDICAL : 120550.62
SOCIAL : 23253652.13
SHOPPING : 7036877.31
PHOTOGRAPHY : 17840110.4
SPORTS : 3638640.14
TRAVEL_AND_LOCAL : 13984077.71
TOOLS : 10801391.3
PERSONALIZATION : 5201482.61
PRODUCTIVITY : 16787331.34
PARENTING : 542603.62
WEATHER : 5074486.2
VIDEO_PLAYERS : 24727872.45
NEWS_AND_MAGAZINES : 9549178.47
MAPS_AND_NAVIGATION : 4056941.77


*On average, communication apps have the most installs: 38,456,119. This number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs:*

In [102]:
## To check all the apps with > 100m installs
for app in android_free:
    if app[1] == 'COMMUNICATION' and (app[5]=='1,000,000,000+' or
                                      app[5]=='500,000,000+' or
                                      app[5]=='100,000,000+'):
        print(app[0],':',app[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Mess

In [105]:
## To check the average installs of all the apps if >100m apps are removed in Communication Category
under_100M=[]

for app in android_free:
    n_installs=app[5]
    n_installs=n_installs.replace('+','')
    n_installs=n_installs.replace(',','')
    n_installs_float=float(n_installs)
    if app[1]=='COMMUNICATION' and n_installs_float<100000000:
        under_100M.append(n_installs_float)

print('Avg under 100M:', sum(under_100M)/len(under_100M))

Avg under 100M: 3603485.3884615386


*This average is 10 times less than the original average for communication apps which is 38456119*

*We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).*

*Again, the main concern is that these app genres might seem more popular than they really are. Moreover, these niches seem to be dominated by a few giants who are hard to compete against.*

*The game genre seems pretty popular, but previously we found out this part of the market seems a bit saturated, so we'd like to come up with a different app recommendation if possible.*

*The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.*

In [106]:
for app in android_free:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+'):
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

*This niche seems to be dominated by software for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's probably not a good idea to build similar apps since there'll be some significant competition.*

*We also notice there are quite a few apps built around the book Quran, which suggests that building an app around a popular book can be profitable. It seems that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets.*

*However, it looks like the market is already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.*

## CONCLUSION

*The market in both Google Play Store and Apple Store is influenced by top apps in major categories which are hard to compete against. The books and reference categories or genre though on both stores seems to have a lot of opportunity. If a raw book can be converted into an app,  with audio version, quizzes, forum for people, that app could become successful.*