# Investment Opportunities -- Analyzing Profitable Mobile App Profiles 
### Introduction
This project showcases how to clean and analyze mobile app data using functions, conditional statements, for-loops, and dictionaries.

The scope of the project will be free, english games currently in the Apple Store and Google Play markets.

**Goal of this project:** Recognize what makes a mobile app profitable.
With free applications, revenue streams are created when users interact with ads or purchase in-game items. 
Ergo, more users means more revenue.

Information like genre, user rating, installs, and reviews is used to find which apps are most popular. 

Initial findings show that 'games' is the most common genre in the Apple Store, accounting for ~60% of apps. Google Play contains a more even distribution amongst its genres, with 'utility' apps being the most common at 8%.

### Data Sources

Apple Store data source direct link: https://dq-content.s3.amazonaws.com/350/AppleStore.csv

Google Play Store data source direct link: https://dq-content.s3.amazonaws.com/350/googleplaystore.csv

### Reading in data and creating a list of lists
The helper function `reader` is imported and used to read the csv files for the apple and google datasets.

In [1]:
# import the reader function from the csv module and read in the files

from csv import reader

applefile = open('AppleStore.csv')
googlefile = open('googleplaystore.csv')

read_file1 = reader(applefile)
read_file2 = reader(googlefile)

# Create list of lists for each dataset

appdata = list(read_file1)
googdata = list(read_file2)


### Exploring Datasets to Determine Analysis Procedure
A function `explore_data` was created to quickly gain insight into a dataset, showing a chosen slice of data along with number of rows and columns.

In [2]:
# Function that takes dataset, start of dataset slice, end of dataset slice, boolean condition as arguments
# and prints the desired slice, alongside number of rows and columns if it contains them

def explore_data(dataset, start, end, rows_and_columns=False):
    
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset)) # Finds the length of the entire dataset ie rows
        print('Number of columns:', len(dataset[0])) # Finds length of row in dataset ie columns

Let's use our previously written `explore_data` function to print a few lines of each dataset.

In [3]:
explore_data(appdata, 0,2,True)
print('\n')
explore_data(googdata, 0,2,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


## Data Cleaning 
 
Begin by searching for errors that need deleting or correction.

This can come in the form of duplicate rows, rows with blanks, etc.

### Deleting or Correcting Errors

One method would be to iterate over the dataset, find blanks, and return the index to either replace blank with or delete index. This will be done at a later time. 

*Fortuitously*, a community discussion on Kaggle found an error in the google play dataset. Row 10473 (header included), contains a blank where a string should be. 

For now, let us delete this row and assume that is the only blank error.

In [4]:
del googdata[10473]

### Finding and Removing Duplicates

#### Finding Duplicates:
Let's create a function `find_duplicates` that takes in a dataset and the desired column index as arguments and returns the number of found duplicates, unique elements, and total entries.

In [5]:
def find_duplicates(dataset, index): # Assumes dataset with header
    
    duplicate_apps = [] # Empty lists are created to isolate both duplicated and unique apps
    unique_apps = []
    
    for row in dataset[1:]: # Iterate over dataset without header and assign a variable to desired index
        name = row[index]
        if name in unique_apps: # If the index is already found in the empty list, it is a duplicate 
            duplicate_apps.append(name)
        else:
            unique_apps.append(name) # If the index is not found in the empty list, it is unique
            
    total = len(duplicate_apps) + len(unique_apps)

    print('Total entries: ' , total,'\n',
          'No. of dup. apps: ', len(duplicate_apps),'\n',
          'No. of unique apps: ',len(unique_apps),'\n')
    
    if len(duplicate_apps) >= 1:
        print('These are some of the duplicated apps: ',duplicate_apps[:3], '\n')

Check the datasets for any duplicates. If any, duplicates and unique elements should equate to the total entries found using the `explore_data` function previously. ***Note: The `find_duplicate` function assumes a header row.***

In [6]:
find_duplicates(appdata, 0)
find_duplicates(googdata,0)

Total entries:  7197 
 No. of dup. apps:  0 
 No. of unique apps:  7197 

Total entries:  10840 
 No. of dup. apps:  1181 
 No. of unique apps:  9659 

These are some of the duplicated apps:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business'] 



We did not include the header row in finding the duplicates. This should be considered for total entries.

The Apple Store dataset returns no duplicates, whereas the Google Play dataset returns 1,181 duplicated rows.

#### Removing Duplicates:
While the duplicate rows can be removed randomly, it is best to understand why there are duplicates and to remove them based on certain criteria.

For example, the google dataset has duplicates for the application, Instagram. Due to the data being saved at different times, the no. of user reviews change each time it is saved. 

Save the latest data by setting a rule that only considers the one with the largest no. of reviews, ie. highest no. of reviews implies the the most recently updated.

In [7]:
# If the app is in the dictionary and its number of reviews is less than the already stored value, update value

# If the app is not in the dictionary, assign that current rating value

reviews_max = {}

for row in googdata[1:]:
    
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews


print('Total apps: ', len(googdata[1:]),'\n',
      'Unique: ', len(reviews_max), '\n',
      'Copies: ', len(googdata[1:]) - len(reviews_max))

Total apps:  10840 
 Unique:  9659 
 Copies:  1181


Remove the duplicates by initializing two lists `android_clean` and `already_added` in which to store the new clean data set and list of copied names, respectively.

Iterate over the google dataset without the header row.

In [8]:
android_clean = [] # This will store the updated, unique app dataset
already_added = [] # This list will only store the names

for row in googdata[1:]:
    
    name = row[0]
    n_reviews = float(row[3])
    
    # Secondary condition used in case of actual duplicates rather than larger values
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)

Explore the data and check that the number of rows is correct.

In [9]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


### Filtering Data

**English Apps**

We previously stated that our scope is free, english games. 

Let's create a function that removes non-english apps from the datasets.

The ASCII values of english characters (0-127) will be utilized to determine what type of data to filter out. 

In [10]:
def is_eng(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

In [11]:
print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))

True
False
False
False


The ASCII value of emojis and other symbols are greater than 127. 

To limit data loss, the filter of having 3 characters greater than 127 should be sufficient for this project.

In [12]:
def is_eng(string):
    
    non_ascii = 0
    
    for character in string:
        
        if ord(character) > 127: # ord() returns the value for that character
            non_ascii += 1 # increments by 1 for every character above the value 127
            
    if non_ascii > 3:
        return False
    else:
        return True

In [13]:
print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))

True
False
True
True


Using this updated function, non-English Apps from both data sets will be filtered out.

Utilize the filter function on the apple dataset and explore the data.

In [14]:
ios_eng = []

for row in appdata[1:]:
    
    name = row[1]
    
    if is_eng(name):
        
        ios_eng.append(row)

explore_data(ios_eng, 0, 2, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 6183
Number of columns: 16


Now do the same for the google dataset.

In [15]:
android_eng = []

for row in android_clean:
    name = row[0]
    if is_eng(name):
        android_eng.append(row)

explore_data(android_eng, 0, 2, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns: 13


**Free Apps**

The scope of this project is *free*, english apps.

For now, the final cleaning of the data will be to extract all free apps. 

In [16]:
free_eng_ios = []

free_eng_android = []


for row in ios_eng:
    
    price = row[4]
    
    if price == '0.0': # The apple dataset has the string of 0.0 in place of free apps
        free_eng_ios.append(row)

for row in android_eng:
    
    price = row[7]
    
    if price == '0': # Unlike the apple dataset, google dataset uses the string of 0 for free apps
        free_eng_android.append(row)

In [17]:
explore_data(free_eng_android, 0, 3, True)
explore_data(free_eng_ios, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'G

## Data Analysis

Our goal is to determine the kinds of apps that are likely to attract more users. (More users = more revenue)

The more users that can download and interact with our applications, the more revenue that will be made. 

Explore the header rows from the datasets and check which columns can help in our analysis.

In [18]:
explore_data(googdata, 0, 1, True)
print('\n')
explore_data(appdata, 0, 1, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10841
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7198
Number of columns: 16


The columns below can be scrutinized in our quest to find what makes an app popular. 

Google data: user ratings[2], no. of ratings[3], installs[5], and content rating[8], categories[1];

Apple data: user rating[7], no. of ratings [5], content rating [10], genres[11]

Create a function `freq_table` that will create a frequency table for any column.

In [19]:
def freq_table(dataset, index):
    
    fq_table = {}
    total = len(dataset)
    
    for row in dataset:
        
        column = row[index]
        
        if column not in fq_table:
            fq_table[column] = 1
            
        else:
            fq_table[column] += 1
    
    for key in fq_table: # Convert the counts into percentages using the total length of the dataset
        
        fq_table[key] /= total
        fq_table[key] *= 100
    
    return fq_table

In [20]:
freq_table(free_eng_ios, 11)

{'Social Networking': 3.2898820608317814,
 'Photo & Video': 4.9658597144630665,
 'Games': 58.16263190564867,
 'Music': 2.0484171322160147,
 'Reference': 0.5586592178770949,
 'Health & Fitness': 2.0173805090006205,
 'Weather': 0.8690254500310366,
 'Utilities': 2.5139664804469275,
 'Travel': 1.2414649286157666,
 'Shopping': 2.60707635009311,
 'News': 1.3345747982619491,
 'Navigation': 0.186219739292365,
 'Lifestyle': 1.5828677839851024,
 'Entertainment': 7.883302296710118,
 'Food & Drink': 0.8069522036002483,
 'Sports': 2.1415270018621975,
 'Book': 0.4345127250155183,
 'Finance': 1.1173184357541899,
 'Education': 3.662321539416512,
 'Productivity': 1.7380509000620732,
 'Business': 0.5276225946617008,
 'Catalogs': 0.12414649286157665,
 'Medical': 0.186219739292365}

The frequency showcases the data we want but not in a pleasing format.

Create a function `display_table` that utilizes the output of the `freq_table` function and returns a sorted table in descending order.

In [21]:
def display_table(dataset, index):
    
    table = freq_table(dataset, index) # Assigning the ouput dictionary to the variable table
    table_display = [] # initializing an empty list
    
    for key in table: # iterating over every key when iterating over a dictionary
        
        key_val_as_tuple = (table[key], key) # creating a tuple of the key:value pair, but in reverse order to be sorted
        table_display.append(key_val_as_tuple) # append the tuple to the empty list

# Since sorted utilizes the first index to order the values, the key:value pair needed reversing

    table_sorted = sorted(table_display, reverse = True)
    
    for entry in table_sorted:
        
        print(entry[1], ':', entry[0])

### Most Common Apps by Genre

**Apple Store**

Using the `display_table` function, we can analyze what genres are most common in the `free_eng_ios` dataset.

The frequency table shows that *Games* are the most common genre, accounting for 58% of the aforementioned dataset. 
Coming in second are *Entertainment* type apps at nearly 8%, an overwhelming difference from *Games*.

The general impression is that games and entertainment applications dominate the `free_eng_ios` dataset. Utility genres like finance, weather, navigation etc., receive less interaction. 

In [22]:
display_table(free_eng_ios, 11) # Primary genre for ios apps

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


**Google Play**

Let's now use the `display_table` function on the `free_eng_android` dataset for the columns Categories and Genres. 

The top category, Family, accounts for 18.9% of apps, while the second top category, Game, is at 9.7%. The difference between first and second is ~10% for this dataset, in comparison to `free_eng_ios`'s 50% difference. 

Similarly, the Genres column most common type is Tools at 8.5%, and the second, Entertainment, at 6%. Only a 2% difference from first and second. 

`free_eng_android` showcases a more balanced distribution between apps, where fun **and** practical apps are popular.

In [23]:
display_table(free_eng_android,1) # categories for google apps
print('____________________','\n')
display_table(free_eng_android,9) # secondary genres for google apps

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

### Most Popular Apps by Genre

Calculate the average number of installs for each app genre to find what kinds are most popular.

**Apple Store**

The column `user_ratings` will be used as a proxy for installs since the apple store dataset does not contain it. 

Let's utilize nested loops and do the following:

    Isolate the apps of each genre
    Add up the user ratings for the apps of that genre
    Divide the sum by the number of apps belonging to that genre (not by the total number of apps)

In [24]:
genres_ios = freq_table(free_eng_ios, 11)

for genre in genres_ios:
    
    total = 0
    len_genre = 0
    
    for app in free_eng_ios:
        
        genre_app = app[11]
        
        if genre_app == genre:
            
            no_of_ratings = float(app[5])
            total += no_of_ratings
            len_genre += 1
            
    average_no = total / len_genre    
    print(f"{genre}: {average_no:,.2f}")

Social Networking: 71,548.35
Photo & Video: 28,441.54
Games: 22,788.67
Music: 57,326.53
Reference: 74,942.11
Health & Fitness: 23,298.02
Weather: 52,279.89
Utilities: 18,684.46
Travel: 28,243.80
Shopping: 26,919.69
News: 21,248.02
Navigation: 86,090.33
Lifestyle: 16,485.76
Entertainment: 14,029.83
Food & Drink: 33,333.92
Sports: 23,008.90
Book: 39,758.50
Finance: 31,467.94
Education: 7,003.98
Productivity: 21,028.41
Business: 7,491.12
Catalogs: 4,004.00
Medical: 612.00


The output above shows the average number of user ratings for each genre. 

While Games and Entertainment are the most common genres for apps, the more popular genres seem to be Navigation, Reference, and Social Networking.

In [25]:
for app in free_eng_ios:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5]) # print name and number of ratings

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


The number of user ratings is heavily skewed by app giants like the Bible and Dictrionary.com. A similar pattern follows the Navigation and Social Networking genre where apps Google Maps and Waze for navigation, and Facebook and Instagram for Social Networking dominate those genres. 

**Google Play**

We find that the `installs` column does not have precise install numbers, rather it contains ranges for the amount of installs.

We can work with the ranges removing unwanted characters and converting into float values.

Let's combine the above with the same logic applied to the apple store dataset in the same loop. 

In [26]:
categories_android = freq_table(free_eng_android, 1)

for category in categories_android:
    
    total = 0
    len_category = 0
    
    for app in free_eng_android:
        
        category_app = app[1]
        
        if category == category_app:
            
            no_of_installs = app[5]
            no_of_installs = no_of_installs.replace('+','')
            no_of_installs = no_of_installs.replace(',','')
            total += float(no_of_installs)
            len_category += 1
            
    avg_installs = total / len_category
    print(f"{category}: {avg_installs:,.2f}")
    

ART_AND_DESIGN: 1,986,335.09
AUTO_AND_VEHICLES: 647,317.82
BEAUTY: 513,151.89
BOOKS_AND_REFERENCE: 8,767,811.89
BUSINESS: 1,712,290.15
COMICS: 817,657.27
COMMUNICATION: 38,456,119.17
DATING: 854,028.83
EDUCATION: 1,833,495.15
ENTERTAINMENT: 11,640,705.88
EVENTS: 253,542.22
FINANCE: 1,387,692.48
FOOD_AND_DRINK: 1,924,897.74
HEALTH_AND_FITNESS: 4,188,821.99
HOUSE_AND_HOME: 1,331,540.56
LIBRARIES_AND_DEMO: 638,503.73
LIFESTYLE: 1,437,816.27
GAME: 15,588,015.60
FAMILY: 3,695,641.82
MEDICAL: 120,550.62
SOCIAL: 23,253,652.13
SHOPPING: 7,036,877.31
PHOTOGRAPHY: 17,840,110.40
SPORTS: 3,638,640.14
TRAVEL_AND_LOCAL: 13,984,077.71
TOOLS: 10,801,391.30
PERSONALIZATION: 5,201,482.61
PRODUCTIVITY: 16,787,331.34
PARENTING: 542,603.62
WEATHER: 5,074,486.20
VIDEO_PLAYERS: 24,727,872.45
NEWS_AND_MAGAZINES: 9,549,178.47
MAPS_AND_NAVIGATION: 4,056,941.77


The output above shows the average number of installs for each category in the Google Play store.

It finds that Communication is most the popular genere, with over 38 million installs. 

However, we can infer that these numbers can be skewed by a few giants, similar to the Apple Store. 

In [27]:
for app in free_eng_android:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '5,000,000+'):
        print(app[0], ':', app[5]) # print name and number of ratings

WhatsApp Messenger : 1,000,000,000+
My Tele2 : 5,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Full Screen Caller ID : 5,000,000+
CIA - Caller ID & Call Blocker : 5,000,000+
Call Control - Call Blocker : 5,000,000+
Sync.ME – Caller ID & Block : 5,000,000+
Gmail : 1,000,000,000+
K-9 Mail : 5,000,000+
Daum Mail - Next Mail : 5,000,000+
Hangouts : 1,000,000,000+
JusTalk - Free Video Calls and Fun Video Chat : 5,000,000+
AT&T Call Protect : 5,000,000+
Viber Messenger : 500,000,

Confirming our assumption, we see that apps like Gmail, WhatsApp and Skype skew the genre data.

## Conclusion

This project showcased an introductory process for collecting, cleaning and analyzing a subset of app data from the Apple and Google Play stores. 

The goal was to find what type of app profiles could be profitable for both markets. 

At surface-level, we find that in both data subsets, Utilities, Social networking and Entertainment apps make up the most common as well as the most popular apps.

One possible idea could be to create a mobile game that allows for interactions between users, offering in-app purchases for better items, increasing the likelihood a user will want to spend to win. (This is highly common, however)

Another idea could be to combine Social Networking with something like a translator app. A repository of text translation could be at the ready for any user accessing that information. The more common translations could be analyzed by a community discussion to give confirmation on the accuracy of formal, informal or slang phrases.