# Profitable Free App Profiles on App Store and Google Play.

In this project we aim to find the main characterstics of the most profitable apps on Apple Store and Google Play.
For apps that are free to download and install, the main source of revenue is the in-app ads which means that the number of users dirrectely influences the profite.

We are going to analyze the data in order to know which apps are more appealing to users, hence: attracts the highest number of users.

# Previewing the Data

Due to the large number of apps on both the Apple Store and Google Play, we are going to use sample data which are already availavle for free in order to save time and resources collecting new data ourselves.

Following are the two datasets that are suitable for this project:

- [Apple IOS App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps): which contains data about 7000 apps on the Apple Store, the dataset can be downloaded [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/download).
- [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps): which contains data about 9660 apps on the Google Play Store, the dataset can be downloaded [here](https://www.kaggle.com/lava18/google-play-store-apps/download).
The datasets can be downloaded locally as csv files.

First, we start by opening the two csv files and save the data as list of lists with help of `reader()` function and then assign the headers and data to two variables for each dataset:

In [1]:
from csv import reader

opened_file = open("/Users/abdallarashwan/Documents/Python Projects/Datasets/AppleStore/AppleStore.csv") #Using absolute path of the downloaded data set
read_file = reader(opened_file)
apple = list(read_file)
apple_header = apple[0]
apple_data = apple[1:]

opened_file = open("/Users/abdallarashwan/Documents/Python Projects/Datasets/GooglePlay/googleplaystore.csv") #Using absolute path of the downloaded data set
read_file = reader(opened_file)
google = list(read_file)
google_header = google[0]
google_data = google[1:]


Following, we define a simple function that can help us explore our data in a more efficient and reliable way.

The function `explore_dataset()` has the following arguments:
- dataset: which is the list of lists containing the data we want to explore.
- start_index: index of the first row of data we want to review.
- end_index: index of the last row of data we want to review (inclusive!).
- rows_and_columns: boolean that defaultes to False, in case True: print totall number of rowns and columns.

In [2]:
def explore_dataset(dataset, start_index, end_index, rows_and_columns = False):
    dataset_section = dataset[start_index:end_index]
    for row in dataset_section:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print("Number of rows: ",len(dataset))
        print("Number of columns: ",len(dataset[0]))

Now let us see the column names in order to have a better understanding of our data.

Following we print the header for the Google Store data and three rows of data:

In [3]:
print(google_header)
print('\n')
explore_dataset(google_data, 0 , 3 , True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


Similarly for the Apple Store data.

In [4]:
print(apple_header)
print('\n')
explore_dataset(apple_data , 0 , 3 , True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows:  7197
Number of columns:  17


For more information about the columns and their description please refer to this [link](https://www.kaggle.com/lava18/google-play-store-apps) for the Google Play dataset and this [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) for the Apple IOS dataset.

Following are the relevant columns for our analysis:
- Google Play Store: `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres`.
- Apple IOS App Store: `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver`, and `prime_genre`.

# Data Cleaning

In this step we clean our data by removing any wrong, duplicate or unneeded data.

## Checking for Values Out of Expected Range

For some of the columns we know the range of values that we expect to find for each app (row in the dataset).

Following we check for such wrong data by getting the maximum and minimum values for such columns to make sure it all existes within expected ranges.

Next, we implement the `min_max_col()` function which takes the following arguments:
- dataset: the dataset which we wish to check.
- col_index: index of the column to check in the dataset.

OUTPUT: the function prints the min and max values found for a specific column together with their respective index in the dataset (index of the row which contains those values).


In [5]:
def min_max_col(dataset, col_index):
    min_val = float(dataset[0][col_index])
    max_index = 0
    max_val = float(dataset[0][col_index])
    min_index = 0
    for row in dataset:
        if float(row[col_index]) > max_val:
            max_val = float(row[col_index])
            max_index = dataset.index(row)
        elif float(row[col_index]) < min_val:
            min_val = float(row[col_index])
            min_index = dataset.index(row)
    print("max value: ", max_val)
    print("Index of max value: ", max_index)
    print("min value: ",min_val)
    print("Index of min value: ", min_index)

### Step 1:

Now we use the previously defined function to find the min and max values for the `Rating` column in the Google Store App dataset.

- The `Rating` column has index 2.
- The range of values should be between 1.0 and 5.0 for ratings.

In [6]:
min_max_col(google_data, 2)

max value:  19.0
Index of max value:  10472
min value:  1.0
Index of min value:  625


As we can see, the rating for the app with index 10472 is larger than the expected range which means that it's wrong and should be deleted from our data.

We delete the row with index 10472 as follows:

In [7]:
del google_data[10472]    

Now let's check again to make sure no other wrong values exist.

In [8]:
min_max_col(google_data, 2)

max value:  5.0
Index of max value:  329
min value:  1.0
Index of min value:  625


As we can see, all values are now within expected range.

Similarly, we check the `user_rating` column in the Apple IOS dataset.
- the index of the column is 8.
- the range of values should be between 0.0 and 5.0

In [9]:
min_max_col(apple_data, 8)

max value:  5.0
Index of max value:  21
min value:  0.0
Index of min value:  199


All values are within limits.

## Checking for Duplicate Data

Following we check for number of unique and duplicate data in the Google Store data set.

In [10]:
google_unique = []
google_duplicate = []

for row in google_data:
    app_name = row[0]
    if app_name in google_unique:
        google_duplicate.append(app_name)
    else:
        google_unique.append(app_name)
print("Number of unique apps: ", len(google_unique))
print("Number of duplicate apps: ", len(google_duplicate))

Number of unique apps:  9659
Number of duplicate apps:  1181


We can see that some apps have multiple entries in our data set.

Now we investigate the duplicate data as follows:

In [11]:
print(google_duplicate[0:20])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


Next we print a sample of the duplicate apps to have a better understanding of the data.

In [12]:
for row in google_data:
    if row[0] == 'Google Ads':
        print(row)

['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29313', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']
['Google Ads', 'BUSINESS', '4.3', '29331', '20M', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 30, 2018', '1.12.0', '4.0.3 and up']


As we can see, in some cases the different data samples for the same app have different `Reviews` values.
In this case: it makes sense to keep the entry with the highest number of ratings because it indicates a more reliable information.

To do so, first we need to have a dictionary which contains the maximum ratings value for each app we have in the data set.



In [13]:
max_reviews_google = {}
for row in google_data:
    name = row[0]
    review = float(row[3])
    if (name in max_reviews_google) and (review > max_reviews_google[name]):
        max_reviews_google[name] = review
    elif name not in max_reviews_google:
        max_reviews_google[name] = review

Now that we have a dictionary containing each app we have with it's max reviews value, we can clean our data set by keeping only tbe unique apps with highest reviews.

In [14]:
google_data_clean = []
already_added = []
for row in google_data:
    app_name = row[0]
    app_reviews = float(row[3])
    if (app_reviews == max_reviews_google[app_name]) and (app_name not in already_added):
        google_data_clean.append(row)
        already_added.append(app_name)

Now let's explore the clean data set.

In [15]:
explore_dataset(google_data_clean , 0 , 5 , True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows:  9659
Number of columns:  13


Similarly, let's check for duplicate apps in the Apple IOS data set.

In [16]:
apple_unique = []
apple_duplicate = []

for row in apple_data:
    name = row[2]
    if name in apple_unique:
        apple_duplicate.append(name)
    else:
        apple_unique.append(name)
print("Number of unique apps: ", len(apple_unique))
print("Number of duplicate apps: ", len(apple_duplicate))

Number of unique apps:  7195
Number of duplicate apps:  2


As we can see, we only have two duplicate apps.

In [17]:
print(apple_duplicate)

['VR Roller Coaster', 'Mannequin Challenge']


Now let's print the duplicate apps so we can decide which ones to keep.

In [18]:
for row in apple_data:
    name = row[2]
    if name in apple_duplicate:
        print(row)
        print("index: ", apple_data.index(row))

['4000', '952877179', 'VR Roller Coaster', '169523200', 'USD', '0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
index:  3319
['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']
index:  5603
['10751', '1173990889', 'Mannequin Challenge', '109705216', 'USD', '0', '668', '87', '3', '3', '1.4', '9+', 'Games', '37', '4', '1', '1']
index:  7092
['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
index:  7128


Similarly, we will only keep the entries with the highest `rating_count`.
We can manualy delete the unwanted data as follows:

In [19]:
apple_data_clean = []
for row in apple_data:
    if not (apple_data.index(row) == 5603 or apple_data.index(row) == 7128):
        apple_data_clean.append(row)


In [20]:
explore_dataset(apple_data_clean , 0 , 5 , True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


Number of rows:  7195
Number of columns:  17


## Removing Non-English Apps

In order to remove non-enlish apps we need to check if the application names are in english or not.

English characters have ASCII values in the range 0 to 127.
Due to some apps having special characters in their name that fall outside that range, we will allow up to three non english characters before we assign the app as non-enlish in order to minize the data loss.

We can get the ASCII value of a character using the `ord()` built-in function.

Following we define the `is_eng()` function which takes the following arguments:
- app_name: the string to be evaluated

Output: returns boolean (True or False).

In [21]:
def is_eng(app_name):
    non_eng = 0
    for c in app_name:
        if ord(c) > 127:
            non_eng += 1        
    if non_eng > 3:
        return False
    return True

Now we can go over our clean data and only keep the english apps as follows:

In [22]:
google_clean_eng = []
for row in google_data_clean:
    name = row[0]
    if is_eng(name):
        google_clean_eng.append(row)

apple_clean_eng = []
for row in apple_data_clean:
    name = row[2]
    if is_eng(name):
        apple_clean_eng.append(row)

Let's explore our clean data of english apps.

In [23]:
explore_dataset(google_clean_eng , 0 , 3 , True)
explore_dataset(apple_clean_eng , 0 , 3 , True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows:  9614
Number of columns:  13
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188

## Removing Non-Free Apps

Following we only keep the free apps in each of the data sets we have so far by checking if the price is equal to 0.0 or not.

In [24]:
google_data_final = []
for row in google_clean_eng:
    Type = row[6]
    if Type == 'Free':
        google_data_final.append(row)
        
apple_data_final = []
for row in apple_clean_eng:
    price = float(row[5])
    if price == 0.0:
        apple_data_final.append(row)

In [25]:
explore_dataset(google_data_final , 0 , 5 , True)
explore_dataset(apple_data_final, 0 , 5 , True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows:  8863
Number of columns:  13
['2', '281796108', 'Evernote - stay organized'

## Most Popular Genres

As mentioned earlier, our goal is to decide which apps attract the most number of users on both Apple IOS Store and Google Play Store, so first we need to find which genres are most popular.

From the data we have, the following columns can be used to describe the app's genre on Google Play Store:

- `Category` : with index 1.
- `Genres` : with index 9.

For Apple IOS Store, following are the columns that describ the app's profile:

- `prime_genre` : with index 12.

In order to be able to see the most common or popular genres, we need to see how many times each value occurs in the previously mentioned columns; hence, we need to calculate frequency tables for those columns.

### First Step

Following we define the function `frquency_table()` that calculates the frequency tables and takes the following arguments:

- `dataset` : the dataset we wish to use to calculate the frquency table.
- `column_index`: index of the column we are interested in.

OUTPUT: Returns the frequency table as dictionary of percentages.

In [26]:
def frequency_table(dataset, column_index):
    table = {}
    total = len(dataset)
    for row in dataset:
        value = row[column_index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1 
            
    table_perc = {}
    for element in table:
        table_perc[element] = ( table[element] / total ) * 100 
    return table_perc


Following function `explore_table()` is defined to take the frequency table as an argument and then prints the values in descending order

In [27]:
def explore_table(table):
    sorted_table = sorted(table.items(), key=lambda x: x[1], reverse=True)
          
    for el in sorted_table:
        print(el[0],": ",el[1], " %")
              

### Frequency Table for `Category` in Google Play Store:

In [28]:
explore_table(frequency_table(google_data_final, 1))

FAMILY :  18.898792733837304  %
GAME :  9.725826469592688  %
TOOLS :  8.462146000225657  %
BUSINESS :  4.592124562789123  %
LIFESTYLE :  3.9038700214374367  %
PRODUCTIVITY :  3.8925871601038025  %
FINANCE :  3.7007785174320205  %
MEDICAL :  3.5315355974275078  %
SPORTS :  3.396141261423897  %
PERSONALIZATION :  3.317161232088458  %
COMMUNICATION :  3.2381812027530184  %
HEALTH_AND_FITNESS :  3.0802211440821394  %
PHOTOGRAPHY :  2.944826808078529  %
NEWS_AND_MAGAZINES :  2.798149610741284  %
SOCIAL :  2.6627552747376737  %
TRAVEL_AND_LOCAL :  2.335552296062281  %
SHOPPING :  2.245289405393208  %
BOOKS_AND_REFERENCE :  2.1437436533904997  %
DATING :  1.8616721200496444  %
VIDEO_PLAYERS :  1.7939749520478394  %
MAPS_AND_NAVIGATION :  1.399074805370642  %
FOOD_AND_DRINK :  1.241114746699763  %
EDUCATION :  1.1621347173643235  %
ENTERTAINMENT :  0.9590432133589079  %
LIBRARIES_AND_DEMO :  0.9364774906916393  %
AUTO_AND_VEHICLES :  0.9251946293580051  %
HOUSE_AND_HOME :  0.8236488773552973  

### Frequency Table for `Genres`  in Google Play Store:

In [29]:
explore_table(frequency_table(google_data_final, 9))

Tools :  8.450863138892023  %
Entertainment :  6.070179397495204  %
Education :  5.348076272142616  %
Business :  4.592124562789123  %
Lifestyle :  3.8925871601038025  %
Productivity :  3.8925871601038025  %
Finance :  3.7007785174320205  %
Medical :  3.5315355974275078  %
Sports :  3.463838429425702  %
Personalization :  3.317161232088458  %
Communication :  3.2381812027530184  %
Action :  3.102786866749408  %
Health & Fitness :  3.0802211440821394  %
Photography :  2.944826808078529  %
News & Magazines :  2.798149610741284  %
Social :  2.6627552747376737  %
Travel & Local :  2.324269434728647  %
Shopping :  2.245289405393208  %
Books & Reference :  2.1437436533904997  %
Simulation :  2.042197901387792  %
Dating :  1.8616721200496444  %
Arcade :  1.8503892587160102  %
Video Players & Editors :  1.771409229380571  %
Casual :  1.7601263680469368  %
Maps & Navigation :  1.399074805370642  %
Food & Drink :  1.241114746699763  %
Puzzle :  1.128286133363421  %
Racing :  0.9928917973598104  

Since we don't have a clear discribtion of the difference between `Category` and `Genres` columns, we are going to only take the `Category` data into consideration since it gives us the needed information in a more defined way and similar to the Apple IOS Store data (hence; easier for comparison).

### Frequency Table for `prime_genre` in Apple IOS Store:

In [30]:
explore_table(frequency_table(apple_data_final, 12))

Games :  58.13664596273293  %
Entertainment :  7.888198757763975  %
Photo & Video :  4.968944099378882  %
Education :  3.6645962732919255  %
Social Networking :  3.291925465838509  %
Shopping :  2.608695652173913  %
Utilities :  2.515527950310559  %
Sports :  2.142857142857143  %
Music :  2.049689440993789  %
Health & Fitness :  2.018633540372671  %
Productivity :  1.7391304347826086  %
Lifestyle :  1.5838509316770186  %
News :  1.3354037267080745  %
Travel :  1.2422360248447204  %
Finance :  1.1180124223602486  %
Weather :  0.8695652173913043  %
Food & Drink :  0.8074534161490683  %
Reference :  0.5590062111801243  %
Business :  0.5279503105590062  %
Book :  0.43478260869565216  %
Navigation :  0.18633540372670807  %
Medical :  0.18633540372670807  %
Catalogs :  0.12422360248447205  %


## Most Used Genres

Now that we know the most popular app genres that exist on both stores, we need to know the average number of user installs for each genre in order to have a more detailed idea of the app profiles that attract more users.

### Most Installed Categories on Google Play Store 

For this we are going to take the `installs` column data into consideration(index 5), note that the data is not precise (open ended values instead) so for our analysis we are going to only use the base value as an approximation that can give us a notion of the popularity.

For example: 10,000+ will be considered as 10,000


In [31]:
categories = frequency_table(google_data_final, 1)

for category in categories:
    total_value = 0
    number_of_values = 0
    for app in google_data_final:
        current_cat = app[1]
        if current_cat == category:
            installs = app[5]
            installs = installs[:-1]
            installs = installs.replace(',','')
            total_value += float(installs)
            number_of_values += 1 
    avg_installs = total_value / number_of_values
    print(category,' :',avg_installs)

ART_AND_DESIGN  : 1986335.0877192982
AUTO_AND_VEHICLES  : 647317.8170731707
BEAUTY  : 513151.88679245283
BOOKS_AND_REFERENCE  : 8767811.894736841
BUSINESS  : 1712290.1474201474
COMICS  : 817657.2727272727
COMMUNICATION  : 38456119.167247385
DATING  : 854028.8303030303
EDUCATION  : 1833495.145631068
ENTERTAINMENT  : 11640705.88235294
EVENTS  : 253542.22222222222
FINANCE  : 1387692.475609756
FOOD_AND_DRINK  : 1924897.7363636363
HEALTH_AND_FITNESS  : 4188821.9853479853
HOUSE_AND_HOME  : 1331540.5616438356
LIBRARIES_AND_DEMO  : 638503.734939759
LIFESTYLE  : 1437816.2687861272
GAME  : 15588015.603248259
FAMILY  : 3697848.1731343283
MEDICAL  : 120550.61980830671
SOCIAL  : 23253652.127118643
SHOPPING  : 7036877.311557789
PHOTOGRAPHY  : 17840110.40229885
SPORTS  : 3638640.1428571427
TRAVEL_AND_LOCAL  : 13984077.710144928
TOOLS  : 10801391.298666667
PERSONALIZATION  : 5201482.6122448975
PRODUCTIVITY  : 16787331.344927534
PARENTING  : 542603.6206896552
WEATHER  : 5074486.197183099
VIDEO_PLAYERS 

As we can see, the apps categoriezed as `COMMUNICATION` have the highest number of average installs, let's take a closer look at such apps.

In [32]:
for app in google_data_final:
    if app[1] == "COMMUNICATION":
        print(app[0] , ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

We can see that some apps have a significantly higher values than others, next we only print the apps with installs equal to 1,000,000,000+ or 500,000,000+

In [33]:
for app in google_data_final:
    if app[1] == "COMMUNICATION" and (app[5]== "1,000,000,000+" or app[5]== "500,000,000+"):
        print(app[0] , ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+


As we can see, this category is dominated by some apps such as WhatsApp and Messenger which are the reason for the high value of the average installs.

In this case we need to find the next best category since this one is already saturated.

Next we investigate the apps in the `VIDEO_PLAYERS` category.

In [34]:
for app in google_data_final:
    if app[1] == "VIDEO_PLAYERS":
        print(app[0] , ':', app[5])

YouTube : 1,000,000,000+
All Video Downloader 2018 : 1,000,000+
Video Downloader : 10,000,000+
HD Video Player : 1,000,000+
Iqiyi (for tablet) : 1,000,000+
Video Player All Format : 10,000,000+
Motorola Gallery : 100,000,000+
Free TV series : 100,000+
Video Player All Format for Android : 500,000+
VLC for Android : 100,000,000+
Code : 10,000,000+
Vote for : 50,000,000+
XX HD Video downloader-Free Video Downloader : 1,000,000+
OBJECTIVE : 1,000,000+
Music - Mp3 Player : 10,000,000+
HD Movie Video Player : 1,000,000+
YouCut - Video Editor & Video Maker, No Watermark : 5,000,000+
Video Editor,Crop Video,Movie Video,Music,Effects : 1,000,000+
YouTube Studio : 10,000,000+
video player for android : 10,000,000+
Vigo Video : 50,000,000+
Google Play Movies & TV : 1,000,000,000+
HTC Service － DLNA : 10,000,000+
VPlayer : 1,000,000+
MiniMovie - Free Video and Slideshow Editor : 50,000,000+
Samsung Video Library : 50,000,000+
OnePlus Gallery : 1,000,000+
LIKE – Magic Video Maker & Community : 50,

Similarly, we take a look at apps with the highest number of installs in this category.

In [35]:
for app in google_data_final:
    if app[1] == "VIDEO_PLAYERS" and (app[5]== "1,000,000,000+" or app[5]== "500,000,000+"):
        print(app[0] , ':', app[5])

YouTube : 1,000,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+


Again in this category, we can see that a couple of applications are the reason for the high number of average installs.

Next, we look at the `BOOKS_AND_REFERENCE` category.

In [36]:
for app in google_data_final:
    if app[1] == "BOOKS_AND_REFERENCE" and (app[5]== "1,000,000,000+" or app[5]== "500,000,000+"):
        print(app[0] , ':', app[5])

Google Play Books : 1,000,000,000+


For the `BOOKS_AND_REFERENCE` category we can see that the high value of average totall installs is not siginficantly biased or affected by certain values compared to `COMMUNICATION` and `VIDEO_PLAYERS` categories.

This shows a good potential in this category.

## Most Reviewed Apps on Apple IOS Store

For this part, we can use the information in the `rating_count_tot` column to give us an idea of which apps are more used because unlike the Google Play Store data, we don't have information about the number of installs.

Following we calculate the average number of totall rating for apps in each genre.

In [37]:
genres = frequency_table(apple_data_final, 12)

for genre in genres:
    total_ratings = 0
    number_of_ratings = 0
    for row in apple_data_final:
        if (row[12] == genre):
            total_ratings += float(row[6])
            number_of_ratings += 1
    avg_tot_ratings = total_ratings / number_of_ratings
    print(genre,": ",avg_tot_ratings)
    

Productivity :  21028.410714285714
Weather :  52279.892857142855
Shopping :  26919.690476190477
Reference :  74942.11111111111
Finance :  31467.944444444445
Music :  57326.530303030304
Utilities :  18684.456790123455
Travel :  28243.8
Social Networking :  71548.34905660378
Sports :  23008.898550724636
Health & Fitness :  23298.015384615384
Games :  22812.92467948718
Food & Drink :  33333.92307692308
News :  21248.023255813954
Book :  39758.5
Photo & Video :  28441.54375
Entertainment :  14029.830708661417
Business :  7491.117647058823
Lifestyle :  16485.764705882353
Education :  7003.983050847458
Navigation :  86090.33333333333
Medical :  612.0
Catalogs :  4004.0


Let's check the apps in `Reference` genre since it have a high totall ratings average."

In [38]:
for app in apple_data_final:
    if app[12] == "Reference":
        print(app[2],': ', app[6])
     

Bible :  985920
Dictionary.com Dictionary & Thesaurus :  200047
Dictionary.com Dictionary & Thesaurus for iPad :  54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran :  18418
Merriam-Webster Dictionary :  16849
Google Translate :  26786
Night Sky :  12122
WWDC :  762
Jishokun-Japanese English Dictionary & Translator :  0
教えて!goo :  0
VPN Express :  14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition :  17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools :  4693
Guides for Pokémon GO - Pokemon GO News and Cheats :  826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free :  718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) :  8535
GUNS MODS for Minecraft PC Edition - Mods Tools :  1497
Real Bike Traffic Rider Virtual Reality Glasses :  8


Now, checking `Book` genre.

In [39]:
for app in apple_data_final:
    if app[12] == "Book":
        print(app[2],': ', app[6])
     

Kindle – Read eBooks, Magazines & Textbooks :  252076
OverDrive – Library eBooks and Audiobooks :  65450
Audible – audio books, original series & podcasts :  105274
BookShout: Read eBooks & Track Your Reading Goals :  879
ikouhoushi :  0
Dr. Seuss Treasury — 50 best kids books :  451
Weirdwood Manor :  197
Green Riding Hood :  392
HOOKED - Chat Stories :  47829
Color Therapy Adult Coloring Book for Adults :  84062
MangaTiara - love comic reader :  0
MangaZERO - comic reader :  9
謎解き2016 :  0
謎解き :  0


From the data, we can see a similar potenial to what we found in the Google Play Store.

# Conclusion

The aim of this project was to analyze the data for both the Google Play Store and Apple IOS Store in order to determine the free english apps that attract the highest number of users and still show a good potential for both stores.

We can conclude that apps in the Books genre or category can have a high potenial of attracting users due to having indications of high average number of users (installs or totall ratings) in our datasets without being biased due to specific apps.