# An Analysis of Applications on Google and Apple Stores
Mobile apps usage is analyzed to determine the type of applications which draw more number of users. As data analysts for a company that builds Android and iOS mobile apps, we try to identify mobile app profiles that are profitable for the Google and Apple App stores and Google Play markets. 
We are only interested in apps that are free and our main source of revenue consists of  ads. This means that our revenue for any given app is mostly influenced by the number of users that use our app. Our goal for this project is to analyze data to help our developers understand what kinds of apps are likely to attract more users.
**Our aim in this analysis is also to make it elegant so the same analysis can be used for other datasets if needed. We do so by defining functions at each step instead of coding separately for each datasets. We also try to avoid inelegant tools such as "counters" inside "For Loops" and try to use "Dictionaries" for counting wherever possible.** 

### Downloading, Opening and Exploring Datasets

1. First we download two datasets with details of [android](https://www.kaggle.com/lava18/google-play-store-apps) and [ios](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) apps. 
 Both data sets are downloaded and saved [here](http://localhost:8888/tree/Projects%20and%20Data%20sets) 

2. We carryout preliminary data assessment by exploring the two downloaded datasets. We do this by defining a function which takes in as input the datasets as "lists of lists/tables" and desired range; and gives as output the data rows over the range as well as the number of rows and columns in the datasets:

       
3. Downloaded datasesets' documentations are described in detail for ease of understanding features (columns) of the data: [android](https://www.kaggle.com/lava18/google-play-store-apps) and [ios](https://www.kaggle.com/lava18/google-play-store-apps).

4. Both datasets are opened in code editor using `open` command, and then read using `import` and `reader` commands. Headers, i.e. column headings are assigned to separate variables `apple_header=appsdata_apple[0]` and `google_header=appsdata_google[0]`, and both datasets without headers are also assinged to separate variables: `dataset_apple=appsdata_apple[1:]`, and `dataset_google=appsdata_google[1:]`. `explore_dataset()` and `print()` functions are then applied to print headers (column headings), first five rows of datasets, number of rows and number of columns in each dataset.                       
                   

## Preliminary Exploration of Datasets
We carryout preliminary data assessment by exploring the two downloaded datasets. We do this by defining a function `explore_dataset(dataset,start,end,rows_and_columns=False)` which takes in as input the datasets as "lists of lists/tables" and desired range; and gives as output the data rows over the range as well as the number of rows and columns in the datasets:

In [1]:
def explore_dataset(dataset,start,end,rows_and_columns=False):
        dataset_slice=dataset[start:end]
        for row in dataset_slice:
            print(row)
            print('\n') #adds an empty line after each row
        if rows_and_columns:
            print('Number of rows',len(dataset))
            print('Number of Columns',len(dataset[0]))

## Opening the Datasets
Both datasets are opened in code editor using `open` command, and then read using `import` and `reader` commands. Headers, i.e. column headings are assigned to separate variables `apple_header=appsdata_apple[0]` and `google_header=appsdata_google[0]`, and both datasets without headers are also assinged to separate variables: `dataset_apple=appsdata_apple[1:]`, and `dataset_google=appsdata_google[1:]`. `explore_dataset()` and `print()` functions are then applied to print headers (column headings), first five rows of datasets, number of rows and number of columns in each dataset.  

In [2]:
opened_file_apple=open('AppleStore.csv',encoding="utf8")
from csv import reader
read_file_apple=reader(opened_file_apple)
appsdata_apple=list(read_file_apple)
apple_header=appsdata_apple[0]
dataset_apple=appsdata_apple[1:]
print('Applestore Columns:',apple_header,'\n')
dataset_apple_5rows= explore_dataset(dataset_apple,0,5,True)
print(dataset_apple_5rows)

Applestore Columns: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows 7197
Number of Columns 

In [3]:
opened_file_google=open('googleplaystore.csv',encoding="utf8")
from csv import reader
read_file_google=reader(opened_file_google)
appsdata_google=list(read_file_google)
google_header=appsdata_google[0]
dataset_google=appsdata_google[1:]
print('Playstore Columns:',google_header,'\n')
dataset_google_5rows= explore_dataset(dataset_google,0,5,True)
print(dataset_google_5rows)

Playstore Columns: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+'

## Short Length Entries
The first step in data cleaning is to identify wrong entries i.e data which is not entered correctly. 1. A common type of "wrong entry" is short length i.e. a row of data in which some column was missed out. First, we crawl the discussion forums to identify any such entries for both [apple](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion) and [google](https://www.kaggle.com/lava18/google-play-store-apps/discussion). We find one such entry for the [dataset_google](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015). We check it using `print` and `len` commands for row 10472 and few adjoining rows, and find it indeed to be the case. Row 10472 has 12 columns instead of 13 in the header and other entries that we checked. We delete this entry using the `delete` command and check length of dataset before and after deletion using the `len` command.
2. After above we check both datasets completely to identify further duplicate entries. For this we write a function `short_length_entries(dataset,n,dataset_header)`, where n is the index number in the dataset where name of the application appears (0 for apple, and 1 for google). We check both data sets for short or missing entries using the above function before and after the delete operation described above, and find that line 10472 was the only short entry in both data sets.
 


In [4]:
print(google_header,'\n')
print(dataset_google[0],'\n')
print(dataset_google[10472])
print(len(dataset_google[10470]),len(dataset_google[10471]),len(dataset_google[10472]),len(dataset_google[10473]))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
13 13 12 13


In [5]:
def short_length_entries(dataset,n,dataset_header):
    short_length_entries_count={}
    short_length_list=[]
    short_length_index=[]
    
    for row in dataset:
        name_of_app=row[n]
        length_of_entries=len(row)-n
        position=dataset.index(row)
        
        if length_of_entries<len(dataset_header)-n:
            short_length_list.append(name_of_app)
            short_length_index.append(position)
    for name_of_app in short_length_list:
        if name_of_app in short_length_entries_count:
            short_length_entries_count[name_of_app]+=1
        else:
            short_length_entries_count[name_of_app]=1
    return len(short_length_list),short_length_entries_count,short_length_index
    
print(short_length_entries(dataset_google,0, google_header))
print(short_length_entries(dataset_apple,1, apple_header))

(1, {'Life Made WI-Fi Touchscreen Photo Frame': 1}, [10472])
(0, {}, [])


In [6]:
print (len(dataset_google))
del dataset_google[10472]
print(len(dataset_google))

10841
10840


In [7]:
  
print(short_length_entries(dataset_google,0, google_header))
print(short_length_entries(dataset_apple,1, apple_header))

(0, {}, [])
(0, {}, [])


## Duplicate Entries
The next step in data cleaning is identification of duplicate entries and exploration of the reasons for duplication. We define a function: `duplicate_entries_count(dataset,n)` in which n is the index number of the row in dataset where the name of the application appear. The instances of duplication are assigned to a dictionary named: `duplicate_entries_freq`. We learn that there are 2 instances of duplication in apple dataset and 1181 instances in the google dataset.

In [8]:
def duplicate_entries_count(dataset,n):
    duplicate_entries=[]
    unique_entries=[]
    duplicate_entries_freq={}
    duplicate_entries_index=[]
    for row in dataset:
        name=row[n]
        index=dataset.index(row)
        if name in unique_entries:
            duplicate_entries.append(name)
            
            duplicate_entries_index.append(index)
        else:
            unique_entries.append(name)
        
    for name in duplicate_entries:    
        if name in duplicate_entries_freq:
            duplicate_entries_freq[name]+=1
        else:
            duplicate_entries_freq[name]=2
    print('length of duplicate entries:',len(duplicate_entries),'length of unique entries:', len(unique_entries))
    print(len(duplicate_entries_index))
    print(len(duplicate_entries_freq))
    print(duplicate_entries_freq)        
print('Apple Dataset Duplicate Data','\n')
duplicate_entries_count(dataset_apple,1)


print('\n','Google Dataset Duplicate Data','\n')
duplicate_entries_count(dataset_google,0)



Apple Dataset Duplicate Data 

length of duplicate entries: 2 length of unique entries: 7195
2
2
{'Mannequin Challenge': 2, 'VR Roller Coaster': 2}

 Google Dataset Duplicate Data 

length of duplicate entries: 1181 length of unique entries: 9659
1181
798
{'Quick PDF Scanner + OCR FREE': 3, 'Box': 3, 'Google My Business': 3, 'ZOOM Cloud Meetings': 2, 'join.me - Simple Meetings': 3, 'Zenefits': 2, 'Google Ads': 3, 'Slack': 3, 'FreshBooks Classic': 2, 'Insightly CRM': 2, 'QuickBooks Accounting: Invoicing & Expenses': 3, 'HipChat - Chat Built for Teams': 2, 'Xero Accounting Software': 2, 'MailChimp - Email, Marketing Automation': 2, 'Crew - Free Messaging and Scheduling': 2, 'Asana: organize team projects': 2, 'Google Analytics': 2, 'AdWords Express': 2, 'Accounting App - Zoho Books': 2, 'Invoice & Time Tracking - Zoho': 2, 'Invoice 2go — Professional Invoices and Estimates': 2, 'SignEasy | Sign and Fill PDF and other Documents': 2, 'Genius Scan - PDF Scanner': 2, 'Tiny Scanner - PDF Scan

After printing a few rows of duplication instances, we learn that while some entries are duplicated due to difference in "number of reviews" in google dataset (index=3) thus denoting a "later entry" for instance 'Instagram', others are duplicated due to oversight despite being similar in content, for instance 'ZOOM Cloud Meetings'. In apple dataset, we see that duplication is due to a number of differences in column entries, but the most important difference is in size_bytes of duplicate entries. We have to develop criteria for removing both kind of duplications.    


Example entry: "Instagram" in which instances of duplication are differentiated by "number of reviews" (index=3)

In [9]:
print('\n',google_header,'\n')      
for row in dataset_google:
    name=row[0]
    if name=='Instagram':
        print(dataset_google.index(row))
        print(row)
        


 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

2545
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
2604
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
2545
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
3909
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Example entry: "Zoom Cloud Meetings" in which instances of duplication are identical

In [10]:
print('\n',google_header,'\n')      
for row in dataset_google:
    name=row[0]
    if name=='ZOOM Cloud Meetings':
        print(dataset_google.index(row))
        print(row)
        


 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

213
['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']
213
['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']


Apple Dataset: In "Mannequin Challenge" and "VR Roller Coaster" duplication are differentiated by "size of applications" (index=2)

In [11]:
print('\n',apple_header,'\n')      
for row in dataset_apple:
    name=row[1]
    if name=='Mannequin Challenge':
        print(dataset_apple.index(row))
        print(row)
        
    if name=='VR Roller Coaster':
        print(dataset_apple.index(row))
        print(row)    


 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

2948
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
4442
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
4463
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
4831
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


While executing the function `duplicate_entries_count(dataset,n)` for google dataset, we found that:  

length of duplicate entries was: 1181 and length of unique entries: 9659
While after executing `print(len(duplicate_entries_freq))` command within the above function, we found that the number of applications duplicated were *798*. 
We use the criteria of retaining entries with "maximum number of reviews" for google dataset, and create a dictionary `reviews_max`, which only contains the entries with greatest number of reviews. 

In [12]:
reviews_max = {}

for row in dataset_google:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print('Length of entries with maximum reviews:',len(reviews_max))        

Length of entries with maximum reviews: 9659


While executing the function `duplicate_entries_count(dataset,n)` for apple dataset, we found that:  

length of duplicate entries was: 2 and length of unique entries: 7195
While after executing `print(len(duplicate_entries_freq))` command within the above function, we found that the number of applications duplicated were *2*. 
We use the criteria of retaining entries with "maximum size" for apple dataset, and create a dictionary `size_max`, which only contains the entries with greatest number of reviews. 

In [13]:
size_max = {}

for row in dataset_apple:
    name = row[1]
    size = float(row[2])
    
    if name in size_max and size_max[name] < size:
        size_max[name] = size
        
    elif name not in size_max:
        size_max[name] = size
print('Length of entries with maximum size:',len(size_max))    

Length of entries with maximum size: 7195


We now create a new datset named `google_dataset_clean` using the dictionary `reviews_max` created above. This dataset will only contain unique entries with maximum number of reviews. In order to ensure thas entries with similar content such as "ZOOM Cloud Meetings" do not find their way in clean dataset, we create another list named already_added and use the code: 
```
if (reviews_max[name] == n_reviews) and (name not in already_added):
    google_dataset_clean.append(row)
    already_added.append(name)
```
The supplementary clause with `if` statement ensures that entries with similar content are added to `google_dataset_clean` only once. 
With the `print(len(google_dataset_clean))` command, we can see that the length of clean dataset is *"9659"* as expected.

In [14]:
google_dataset_clean=[]
already_added=[]
for row in dataset_google:
    name=row[0]
    n_reviews = float(row[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        google_dataset_clean.append(row)
        already_added.append(name)
print(len(google_dataset_clean))        
    

9659


For apple, We also  create a new datset named `apple_dataset_clean` using the dictionary `size_max`created above. This dataset will only contain unique entries with maximum size. In order to ensure thas entries with similar content do not find their way in clean dataset, we create another list named already_added (there are none but still as a coding best practice, we include this step)  and use the code: 
```
if (size_max[name] == size) and (name not in already_added):
    apple_dataset_clean.append(row)
    already_added.append(name)
```
The supplementary clause with `if` statement ensures that entries with similar content are added to `apple_dataset_clean` only once.
With the `print(len(apple_dataset_clean))` command, we can see that the length of clean dataset is *"7195"* as expected.

In [15]:
apple_dataset_clean=[]
already_added=[]
for row in dataset_apple:
    name=row[1]
    size = float(row[2])
    if (size_max[name] == size) and (name not in already_added):
        apple_dataset_clean.append(row)
        already_added.append(name)
print(len(apple_dataset_clean))        
    

7195


## Non-English Entries
After removal of short-length entries, next step is to remove entries in both data sets i.e. `google_dataset_clean` and `apple_dataset_clean` which contain names of applications which do not cater for an English speaking audience. We start by defining a function `english_app(name)` which loops over application name and checks each string character against the **maximum ASCII value** for English i.e. **127** using `ord("character')` command. If the value exceeds 127, the function returns `False` (or "non-English app) else it returns `True` (or English-App). 

In [16]:
def english_app(name):
    for character in name:
        if ord(character)>127:
            return False
       
    return True
                           

We check the output against few entries from both datasets and find that some English apps with special characters out of **ASCII range (127)** are being interpreted as non-English apps and are returning an output `False` when `english_app(name)` function is executed.

In [17]:
print(english_app("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(english_app("Instagram"))
print(english_app("Box播"))
print(english_app(google_dataset_clean[7940][0]))
print(english_app(google_dataset_clean[4412][0]))
print(english_app("Instachat 😜"))
print(english_app("Docs To Go™ Free Office Suite"))
print(ord('™'))
print(ord('😜'))

False
True
False
False
False
False
False
8482
128540


### Amended Function
In order to cater for above anomaly, we define a criteria for retaining English apps with special characters by defining a function `english_app_0(name)`. The function is initialized by assigning a `variable non_ascii = 0` which is then looped over the string 'name'. If there are more than 3 non-English characters in name, the function interprets the app name to be English else it considers it non-English.  

In [18]:
def english_app_0(name):
    non_ascii = 0
    
    for character in name:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(english_app_0('Docs To Go™ Free Office Suite'))
print(english_app_0('Instachat 😜'))

True
True


### Alternate Amended Function
Alternately, we can define an 'Alternate Amended Function' `english_app_1(name)` which loops over the string name and assigns ASCII character values>127 to a `list ASCII_values`. The list is looped over the values exceeding 127 to generate a frequency table `ASCII_freq`. We check the length of the `ASCII_freq` and assign the condition that the application should be considered a non-English app, if the length of `ASCCII_freq` exceeds 4 (not 3 as done above, in order to get more entries for data analysis). We recheck the entries which were considered **non-English** when we checked them above, and find that they have now been reassigned as **English apps**.

In [19]:
def english_app_1(name):
    ASCII_values=[]
    ASCII_freq={}
        
    for character in name:
        value=ord(character)      
        if value>127:
            ASCII_values.append(value)
    for value in ASCII_values:
        if value in ASCII_freq:
            ASCII_freq[value]+=1
        else:
            ASCII_freq[value]=1
    if len(ASCII_freq)>=4:
        return False
        
    else:
        return True 
    
    

In [20]:
print(english_app_1("爱奇艺PPS -《欢乐颂2》电视剧热播"))
print(english_app_1("Instagram"))
print(english_app_1("Box"))
print(english_app_1(google_dataset_clean[7940][0]))
print(english_app_1(google_dataset_clean[4412][0]))
print(english_app_1("Instachat 😜"))
print(english_app_1("Docs To Go™ Free Office Suite"))

False
True
True
False
False
True
True


## English Apps Dataset
We now use the alternate amended function `english_app_1` to generate two datasets `dataset_english_google` and `dataset_english_apple` by removing the non-English apps from datasets `google_dataset_clean` and `apple_dataset_clean`. For this we create empty lists and loop `google_dataset_clean` and `apple_dataset_clean` over the entries. If the function `english_app_1(name)` returns an English app both datasets i.e. `dataset_english_google` and `dataset_english_apple` are populated using separate for loops. We check the length of rows and columns and find that we now have 9619 entries for google and 6189 entries for apple.   

In [21]:
dataset_english_google=[]
dataset_english_apple=[]
for row in google_dataset_clean:
    name=row[0]
    if english_app_1(name):
        dataset_english_google.append(row)

for row in apple_dataset_clean:
    name=row[1]
    if english_app_1(name):
        dataset_english_apple.append(row)
print('Google no of rows',len(dataset_english_google))
print('Google no of columns',len(dataset_english_google[0]))
print('Apple no of rows',len(dataset_english_apple)) 
print('Apple no of columns',len(dataset_english_apple[0]))

Google no of rows 9619
Google no of columns 13
Apple no of rows 6189
Apple no of columns 16


### Alternate Method (Use of Function)
We can also use the alternate amended function `english_app_1(name)`within a function `dataset_english_apps(dataset,n)` to generate the dataset `dataset_english`which can be used universally. The parameters of this function are **'dataset'** which can be any dataset and **'n'** which is the index of app name in dataset. We do this to enable us to use any number of datasets without having to write additional code each time, as done above. We also use a third function `explore_dataset(dataset,start,stop,rows_and_columns=False)` as defined in the beginning within the function `dataset_english_apps(dataset,n)`. From this We find the length of rows and columns to be 9619 entries (13 columns) for google and 6189 entries (16 columns) for apple (same as above).

In [22]:
def dataset_english_apps(dataset,n):
    dataset_english=[]
    for row in dataset:
        name=row[n]
        if english_app_1(name):
            dataset_english.append(row)
    explore_dataset(dataset_english,0,3,True)
    return dataset_english
    

print('Google Data')
dataset_english_google=dataset_english_apps(google_dataset_clean,0)
print('\n','Apple Data')
dataset_english_apple=dataset_english_apps(apple_dataset_clean,1)
print(len(dataset_english_google))



Google Data
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows 9619
Number of Columns 13

 Apple Data
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.

## Isolating Free Apps
We create separate empty dictionaries dataset_free_google=[] and dataset_free_apple=[] and use a for loop separately for both datasets and loop over rows in dataset. The conditional: `if price=='0':` and  `if price=='0.0':`caters for different formats of price used in two datasets. After removing non-free applictaion we are left with **8869** entries for google dataset and **3227** for apple (see alternate method below). 

In [23]:
dataset_free_google=[]
dataset_free_apple=[]

for row in dataset_english_google:
    price=row[7]
    if price=='0':
        dataset_free_google.append(row)
    
    
for row in dataset_english_apple:
    price=row[4]
    if price=='0.0':
        dataset_free_apple.append(row)    
    
    
print('Google Free Apps Dataset Length',len(dataset_free_google))
print('Apple Free Apps Dataset Length',len(dataset_free_apple))


Google Free Apps Dataset Length 8869
Apple Free Apps Dataset Length 3227


### Isolating Free Apps - Alternate Method (use of function)
Since we are exploring more than one datasets, it is convenient to write a global function which would obviate the need to write code separately for each dataset. We define a function `dataset_free_apps(dataset,n)` in which **n** is the index of the price column. We create an empty dictionary `dataset_free` and loop over rows in dataset. The conditional: `if price=='Free'or price=='0.0'`caters for various formats of price used in different datasets. After removing non-free applictaion we are left with **8869** entries for google dataset and **3227** for apple (same as above) 

In [24]:
def dataset_free_apps(dataset,n):
    dataset_free=[]
    for row in dataset:
        price=row[n]
        if price=='Free'or price=='0.0':
            dataset_free.append(row)
    return dataset_free 

dataset_free_google=dataset_free_apps(dataset_english_google,6)
dataset_free_apple=dataset_free_apps(dataset_english_apple,4)
print('Google Free Apps Dataset Length',len(dataset_free_google))
print('Apple Free Apps Dataset Length',len(dataset_free_apple))
        

Google Free Apps Dataset Length 8868
Apple Free Apps Dataset Length 3227


## Most Common Apps by Genre/Category
Since we want to determine the kinds of apps that are likely to attract more users and in turn more ad revenue, we must examine the datasets for popularity of certain genres or categories. In google, we have columns for "category"(index=1) and "genres"(index=9) while in apple dataset, we have column for "prime genre"(index=11). 

In [25]:
def freq_table(dataset,n):
    frequency_table={}
    for row in dataset:
        value=row[n]
        if value in frequency_table:
            frequency_table[value]+=1
        else:
            frequency_table[value]=1
    dataset_free_apps(dataset,n) #called this previously defined function which returns the final dataset such as "dataset_free_google"
    frequency_table_percentage={}
    for key in frequency_table:
        percentage=(frequency_table[key]/len(dataset))*100
        frequency_table_percentage[key]=percentage
    return frequency_table_percentage
freq_table_percentage_google_cat=freq_table(dataset_free_google,1)
print(len(freq_table_percentage_google_cat))
freq_table_percentage_google_gen=freq_table(dataset_free_google,9)
print(len(freq_table_percentage_google_gen))
freq_table_percentage_apple_gen=freq_table(dataset_free_apple,11)
print(len(freq_table_percentage_apple_gen))


33
114
23


### Display and Sorting in Descending Order

In [26]:
def display_table(dataset,index):
    freq_table_percentage=freq_table(dataset,index) #a function previously defined which returns frequency_table_percentage
    table_display=[]
    for key in freq_table_percentage:
        key_val_as_tuple=(freq_table_percentage[key],key)
        table_display.append(key_val_as_tuple)
    table_sorted=sorted(table_display,reverse=True) 
    for entry in table_sorted:
        print(entry[1],':',entry[0])

In [27]:
display_table(dataset_free_apple,11) #Prime Genre

Games : 58.13449023861171
Entertainment : 7.871087697551905
Photo & Video : 4.9581654787728535
Education : 3.65664704059498
Social Networking : 3.284784629687016
Shopping : 2.6030368763557483
Utilities : 2.5100712736287574
Sports : 2.138208862720793
Music : 2.045243259993802
Health & Fitness : 2.014254725751472
Productivity : 1.735357917570499
Lifestyle : 1.6114037806011776
News : 1.3325069724202045
Travel : 1.270529903935544
Finance : 1.1155872327238923
Weather : 0.8676789587852495
Food & Drink : 0.8366904245429191
Reference : 0.5577936163619461
Business : 0.5268050821196157
Book : 0.43383947939262474
Navigation : 0.18593120545398203
Medical : 0.18593120545398203
Catalogs : 0.12395413696932135


### Analysis of Apple "Prime Genre" Data
There are **23 "prime genres"**. Approx 58% of free apps out of **3227 analzed** are games. Entertainment apps are close to 8%, photo and video apps are close to 5%, education 3.66%,  social networking apps 3.29% of the apps in our data set. Shopping, Sports, Music and Health between 2 and 3%. Productivity, Lifestyle and Travel all between 1 and 2%. We will return to this data later while analyzing genre popularity.

### Google Apps Analysis

In [28]:
display_table(dataset_free_google,1) #Category

FAMILY : 18.910690121786196
GAME : 9.720342805593143
TOOLS : 8.457374830852503
BUSINESS : 4.589535408209292
LIFESTYLE : 3.9016689219666216
PRODUCTIVITY : 3.8903924221921518
FINANCE : 3.6986919260261617
MEDICAL : 3.5295444294091114
SPORTS : 3.394226432115471
PERSONALIZATION : 3.3265674334686515
COMMUNICATION : 3.236355435272891
HEALTH_AND_FITNESS : 3.078484438430311
PHOTOGRAPHY : 2.943166441136671
NEWS_AND_MAGAZINES : 2.796571944068561
SOCIAL : 2.661253946774921
TRAVEL_AND_LOCAL : 2.334235453315291
SHOPPING : 2.244023455119531
BOOKS_AND_REFERENCE : 2.153811456923771
DATING : 1.8606224627875507
VIDEO_PLAYERS : 1.7929634641407306
MAPS_AND_NAVIGATION : 1.4095624718087505
FOOD_AND_DRINK : 1.2404149751917006
EDUCATION : 1.1614794767704104
ENTERTAINMENT : 0.9585024808299503
LIBRARIES_AND_DEMO : 0.9359494812810103
AUTO_AND_VEHICLES : 0.9246729815065404
HOUSE_AND_HOME : 0.8231844835363104
WEATHER : 0.8006314839873704
EVENTS : 0.7104194857916103
PARENTING : 0.6540369869192603
ART_AND_DESIGN : 0.

### Analysis of Google "Category" Data
There are **33 "Categories"**. Approx 19% of free apps out of **8868 analzed** belong to Family category followed by Games at nearly 10%. Health, Communication, Medical and Business are between 3-5%. Books, Shopping, Travel and Books are between 2-3%. Education, Food, Maps & Navigation and Dating are between 1-2%. **Categories** are how apps data is organized on the Play Stor and it provides an ease of access to the user. 
![image.png](attachment:image.png)

### Google "Genres" Data

In [29]:
display_table(dataset_free_google,9) #Genre

Tools : 8.446098331078034
Entertainment : 6.078033378439333
Education : 5.356337392873252
Business : 4.589535408209292
Productivity : 3.8903924221921518
Lifestyle : 3.8903924221921518
Finance : 3.6986919260261617
Medical : 3.5295444294091114
Sports : 3.4618854307622913
Personalization : 3.3265674334686515
Communication : 3.236355435272891
Action : 3.1010374379792514
Health & Fitness : 3.078484438430311
Photography : 2.943166441136671
News & Magazines : 2.796571944068561
Social : 2.661253946774921
Travel & Local : 2.322958953540821
Shopping : 2.244023455119531
Books & Reference : 2.153811456923771
Simulation : 2.041046459179071
Dating : 1.8606224627875507
Arcade : 1.8493459630130809
Video Players & Editors : 1.7704104645917909
Casual : 1.7591339648173208
Maps & Navigation : 1.4095624718087505
Food & Drink : 1.2404149751917006
Puzzle : 1.1276499774470004
Racing : 0.9923319801533603
Role Playing : 0.9359494812810103
Libraries & Demo : 0.9359494812810103
Auto & Vehicles : 0.924672981506540

There are **114 "Genres"** and these seem to depict a further detailed classification during data compilation . These are dominated by Tools (8.4%), Entertainment(6%), Education (5.4%), and Business (4.5%). Since the basis of these classifications are too detailed and not explained, we will not use this classification further and focus on popular apps by "Category" classification as provided on Play Store. 


## Most Popular Apps by Genre/Category
The number of applications in a genre or category is in no way the final determinant of a genre's popularity, as these may not be used by a significant number of users. We need to explore the genres and categories to determine the popularity of apps in those genres/categories. In order to find out what genres are the most popular (have the most users), we can calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. We will use instead the total number of user ratings as a proxy in this case.
We can use the freq_table(dataset,n) function defined earlier to determine average user ratings counts per genre in Apple Store and installs per category in google Play Store. 

### 

In [30]:
rating_count_sum={}
genre_len={}

for row in dataset_free_apple:
    genre_app=row[11]
    rating_count=float(row[5])
    if genre_app in rating_count_sum:
        rating_count_sum[genre_app]+=rating_count
    else:
        rating_count_sum[genre_app]=rating_count
    if genre_app in genre_len:
        genre_len[genre_app]+=1
    else:
        genre_len[genre_app]=1
for genre_app in rating_count_sum:         
    avg_rating_genre=rating_count_sum[genre_app]/genre_len[genre_app]
    print(genre_app,':',avg_rating_genre)


Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22764.26172707889
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 27581.829268292684
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16168.73076923077
Entertainment : 14029.830708661417
Food & Drink : 32099.51851851852
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [31]:
def avg_rating(dataset,n1,n2): #n1=index of genre;n2=index of rating or instals
    rating_count_sum={}
    genre_len={}

    for row in dataset:
        genre_app=row[n1]
        rating_count=float(row[n2])
        if genre_app in rating_count_sum:
            rating_count_sum[genre_app]+=rating_count
        else:
            rating_count_sum[genre_app]=rating_count
        if genre_app in genre_len:
            genre_len[genre_app]+=1
        else:
            genre_len[genre_app]=1
    for genre_app in rating_count_sum:         
        avg_rating_genre=rating_count_sum[genre_app]/genre_len[genre_app]
        print(genre_app,':',avg_rating_genre)
    print(genre_len)    
        


In [32]:
avg_rating(dataset_free_apple,11,5)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22764.26172707889
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 27581.829268292684
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16168.73076923077
Entertainment : 14029.830708661417
Food & Drink : 32099.51851851852
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0
{'Social Networking': 106, 'Photo & Video': 160, 'Games': 1876, 'Music': 66, 'Reference': 18, 'Health & Fitness': 65, 'Weather': 28, 'Utilities': 81, 'Travel': 41, 'Shopping': 84, 'News': 43, 'Navigation': 6, 'Lifestyle': 52, 'Entertainment': 254, 'Food & Drink': 27, 'Sports': 69, 'Book': 14, 'Finance': 36, 'Education': 118, 'Productivity'

Above data when compared with our own objectives leads us to draw certain preliminary conclusions. First, we cannot develop apps in categories which require provision or exchange of commodities such as "Food & Drink", "Shopping", "Health & Fitness"; or certain speciality areas such as "Music", "Productivity", "Medical" and "Navigation." Some categories such as "Social Networking", "Games" and Entertainment are occupied by corporate giants with whom we cannot compete as a start up. Some categories are of low interest to general public and can also be discarded such as "Catalogs" and "Business." This leaves us with two very interesting categories to explore i.e. "Travel", "Finance" and "Weather." 

In [33]:
for row in dataset_free_apple:
    genre_app=row[11]
    app_name=row[1]
    rating_count=row[5]
    if genre_app=="Travel":
        print(app_name, rating_count)

Google Earth 446185
Yelp - Nearby Restaurants, Shopping & Services 223885
GasBuddy 145549
TripAdvisor Hotels Flights Restaurants 56194
Uber 49466
Lyft 46922
HotelTonight - Great Deals on Last Minute Hotels 32341
Hotels & Vacation Rentals by Booking.com 31261
Southwest Airlines 30552
Airbnb 22302
Expedia Hotels, Flights & Vacation Package Deals 10278
Fly Delta 8094
Hopper - Predict, Watch & Book Flights 6944
United Airlines 5748
Skiplagged — Actually Cheap Flights & Hotels 1851
Viator Tours & Activities 1839
iExit Interstate Exit Guide 1798
Gogo Entertainment 1482
Google Street View 1450
滴滴出行 1103
Webcams – EarthCam 912
HISTORY Here 685
DB Navigator 512
Mobike - Dockless Bike Share 494
MiFlight™ – Airport security line wait times at checkpoints for domestic and international travelers 493
BlaBlaCar - Trusted Carpooling 397
Six Flags 353
Google Trips – Travel planner 329
Voyages-sncf.com : book train and bus tickets 268
Trainline UK: Live Train Times, Tickets & Planner 248
Urlaubspiraten

In [34]:
for row in dataset_free_apple:
    genre_app=row[11]
    app_name=row[1]
    rating_count=row[5]
    if genre_app=="Finance":
        print(app_name, rating_count)    

Chase Mobile℠ 233270
Mint: Personal Finance, Budget, Bills & Money 232940
Bank of America - Mobile Banking 119773
PayPal - Send and request money safely 119487
Credit Karma: Free Credit Scores, Reports & Alerts 101679
Capital One Mobile 56110
Citi Mobile® 48822
Wells Fargo Mobile 43064
Chase Mobile 34322
Square Cash - Send Money for Free 23775
Capital One for iPad 21858
Venmo 21090
USAA Mobile 19946
TaxCaster – Free tax refund calculator 17516
Amex Mobile 11421
TurboTax Tax Return App - File 2016 income taxes 9635
Bank of America - Mobile Banking for iPad 7569
Wells Fargo for iPad 2207
Stash Invest: Investing & Financial Education 1655
Digit: Save Money Without Thinking About It 1506
IRS2Go 1329
Capital One CreditWise - Credit score and report 1019
U by BB&T 790
Paribus - Rebates When Prices Drop 768
KeyBank Mobile 623
VyStar Mobile Banking for iPhone 434
Sparkasse - Your mobile branch 77
VyStar Mobile Banking for iPad 57
Zaim 44
Ma Banque 17
Lloyds Bank Mobile Banking 17
Suica 10
Hali

In [35]:
for row in dataset_free_apple:
    genre_app=row[11]
    app_name=row[1]
    rating_count=row[5]
    if genre_app=="Weather":
        print(app_name, rating_count)

The Weather Channel: Forecast, Radar & Alerts 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking 208648
WeatherBug - Local Weather, Radar, Maps, Alerts 188583
MyRadar NOAA Weather Radar Forecast 150158
AccuWeather - Weather for Life 144214
Yahoo Weather 112603
Weather Underground: Custom Forecast & Local Radar 49192
NOAA Weather Radar - Weather Forecast & HD Radar 45696
Weather Live Free - Weather Forecast & Alerts 35702
Storm Radar 22792
QuakeFeed Earthquake Map, Alerts, and News 6081
Moji Weather - Free Weather Forecast 2333
Hurricane by American Red Cross 1158
Forecast Bar 375
Hurricane Tracker WESH 2 Orlando, Central Florida 203
FEMA 128
iWeather - World weather forecast 80
Weather - Radar - Storm with Morecast App 78
Yurekuru Call 53
Weather & Radar 37
WRAL Weather Alert 25
Météo-France 24
JaxReady 22
Freddy the Frogcaster's Weather Station 14
Almanac Long-Range Weather Forecast 12
TodayAir 0
wetter.com 0
WarnWetter 0


### Preliminary Conclusions
"Travel" has an average user rating of over 27,000, Finance over 31,000, and "Weather" over 52,000, all of which are promising. "Travel" ratings are skewed due to Google Earth which is not really a travel app. Similarly, Uber and Lyft are ride hailing services and not really travel apps. This leaves the category quite wide open with a couple of main players like Trip Advisor, Booking.com, Air bnb and Yelp. This category can be exploited if we work together with travel leaders such as "Lonely Planet" to turn their guides into apps for specific countries. The "Finance" category is dominated by banking apps and can be exploited by making home accounting which would automate a lot of accounting work at household level. Weather apps can be upgraded using an existing model to incorporate environmental services such as air quality, bio hazards, pollen amount etc. in combination with routine weather services. Let us see if these preliminary conclusions are also supported by google data. 


### Most Popular Genres - Google
We will only look at the categories data compared against number of instals.

In [36]:
display_table(dataset_free_google, 5)

1,000,000+ : 15.719440685611186
100,000+ : 11.558412268831754
10,000,000+ : 10.543527289129456
10,000+ : 10.205232295895355
1,000+ : 8.412268831754623
100+ : 6.9237708615245825
5,000,000+ : 6.822282363554352
500,000+ : 5.559314388813712
50,000+ : 4.769959404600812
5,000+ : 4.510599909788001
10+ : 3.5408209291835817
500+ : 3.247631935047361
50,000,000+ : 2.3004059539918806
100,000,000+ : 2.1312584573748308
50+ : 1.9170049616599005
5+ : 0.7893549842129003
1+ : 0.5074424898511503
500,000,000+ : 0.2706359945872801
1,000,000,000+ : 0.2255299954894001
0+ : 0.04510599909788002


We have to convert each install number to float for computation and also compute the average number of installs for each category. We define another function dataset_float_instal(dataset,n): #n=index of instals/ratings, and also use a previously defined function avg_rating(dataset,n1,n2) to determine average number of instals for each category.

In [46]:
def dataset_float_instal(dataset,n): #n=index of instals/ratings

    for row in dataset:
    
        row[n]=row[n].replace(',','')
        row[n]=row[n].replace('+','')
        float(row[n])
    return dataset    

dataset_float_instal_google=dataset_float_instal(dataset_free_google,5)    

avg_rating(dataset_float_instal_google,1,5)
    
 

      

    
        




ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8721912.356020942
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3693444.65712582
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5183850.806779661
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_MAG

"Travel" has an average instals of over 13 million with 207 apps, Finance over 1.3 million with 328 apps, and Weather with over 5 million instals and 71 apps. Let us examine these categories in detail.  

In [49]:
for row in dataset_float_instal_google:
    category_app=row[1]
    app_name=row[0]
    instals=row[5]
    if category_app=="TRAVEL_AND_LOCAL":
        print(app_name, instals)

trivago: Hotels & Travel 50000000
Hopper - Watch & Book Flights 5000000
TripIt: Travel Organizer 1000000
Trip by Skyscanner - City & Travel Guide 500000
CityMaps2Go Plan Trips Travel Guide Offline Maps 1000000
KAYAK Flights, Hotels & Cars 10000000
World Travel Guide by Triposo 500000
Booking.com Travel Deals 100000000
Hostelworld: Hostels & Cheap Hotels Travel App 1000000
Google Trips - Travel Planner 5000000
GPS Map Free 5000000
GasBuddy: Find Cheap Gas 10000000
Southwest Airlines 5000000
AT&T Navigator: Maps, Traffic 10000000
VZ Navigator 50000000
KakaoMap - Map / Navigation 10000000
AirAsia 10000000
Expedia Hotels, Flights & Car Rental Travel Deals 10000000
Goibibo - Flight Hotel Bus Car IRCTC Booking App 10000000
Allegiant 1000000
Amtrak 1000000
JAL (Domestic and international flights) 1000000
Flight & Hotel Booking App - ixigo 5000000
VZ Navigator for Tablets 500000
TripAdvisor Hotels Flights Restaurants Attractions 100000000
HSL - Tickets, route planner and information 100000
Wis

### Conclusion
It can be seen that "Travel" category is very jumbled up with hotel bookings, airline reservations and mapping apps. There are virtually no "travel guides", so this remains a promising area. 

In [52]:
for row in dataset_float_instal_google:
    category_app=row[1]
    app_name=row[0]
    instals=row[5]
    if category_app=="FINANCE":
        print(app_name, instals)

K PLUS 10000000
ING Banking 1000000
Citibanamex Movil 5000000
The postal bank 5000000
KTB Netbank 5000000
Mobile Bancomer 10000000
Nedbank Money 500000
SCB EASY 5000000
CASHIER 10000000
Rabo Banking 1000000
Capitec Remote Banking 1000000
Itau bank 10000000
Nubank 5000000
The Societe Generale App 1000000
IKO 1000000
Cash App 10000000
Standard Bank / Stanbic Bank 1000000
Bualuang mBanking 5000000
Intesa Sanpaolo Mobile 1000000
UBA Mobile Banking 1000000
BBVA Spain 5000000
MyMo by GSB 1000000
VTB-Online 5000000
Ecobank Mobile Banking 1000000
Banorte Movil 1000000
Zenith Bank Mobile App 1000000
GCash - Buy Load, Pay Bills, Send Money 1000000
Post Bank 1000000
İşCep 10000000
People's Bank 1000000
Transfer 5000000
T-Mobile in 1000000
TrueMoney Wallet 5000000
Alfa-Bank (Alfa-Bank) 1000000
Bank of Brazil 10000000
WiseBanyan - Invest For Free 10000
Robinhood - Investing, No Fees 1000000
Wells Fargo Daily Change 100000
Even - organize your money, get paid early 100000
Digit Save Money Automatica

### Conclusion
In Finance most applications are for mobile banking and virtually no personal accounting apps. This also remains a promising area. 

In [53]:
for row in dataset_float_instal_google:
    category_app=row[1]
    app_name=row[0]
    instals=row[5]
    if category_app=="WEATHER":
        print(app_name, instals)

The Weather Channel: Rain Forecast & Storm Alerts 50000000
Weather forecast 1000000
AccuWeather: Daily Forecast & Live Weather Reports 50000000
Live Weather Pro 10000
Weather by WeatherBug: Forecast, Radar & Alerts 10000000
weather - weather forecast 1000000
MyRadar NOAA Weather Radar 10000000
SMHI Weather 1000000
Free live weather on screen 1000000
Weather Radar Widget 1000000
Weather –Simple weather forecast 10000000
Weather Crave 5000000
Klara weather 500000
Yahoo Weather 10000000
Real time Weather Forecast 1000000
METEO FRANCE 5000000
APE Weather ( Live Forecast) 5000000
Live Weather & Daily Local Weather Forecast 1000000
Weather 10000000
Rainfall radar - weather 5000000
Yahoo! Weather for SH Forecast for understanding the approach of rain clouds Free 1000000
The Weather Network 5000000
Klart.se - Sweden's best weather 1000000
GO Weather - Widget, Theme, Wallpaper, Efficient 50000000
Info BMKG 1000000
Weather From DMI/YR 100000
wetter.com - Weather and Radar 10000000
Storm Radar: T

### Conclusion
In Weather most applications are for and very little environmental variable related to human health and well being built in. So, we can also explore this area for building a hybrid app. 