# App Profiles Project

## Future App Development Strategy
<img src="https://www.apple.com/v/app-store/a/images/overview/icon_appstore__ev0z770zyxoy_large.png" width=90px alt="App Store Logo" title="App Store Logo" align='left'/>

<img src="https://www.notebookcheck.net/fileadmin/Notebooks/News/_nc3/20180131_Play_Store_Logo.png" alt="Play Store Logo" title="Play Store Logo" width=100px/>

#### Author: Frank Pereny
#### Date: November 19th, 2020

## Introduction:
### Project Summary:
The direction of our future development efforts is critical to profitability and success.  The purpose of this project is understand what apps are most likely to attract users on the Google Play Store and Apple App Store.  To accomplish this goal, we will collect and analyze a set of available data.  By aligning with prevailing market data and trends, we will minimize risk on new development efforts.

As of September, 2018, there were approximately 2 million iOS apps and 2.1 million Android apps available.  The Windows Store, Amazon Appstore and Blackberry World has approximately 1.35 million apps available combined.  Since Google and Apple dominate the market, and we do not see any reason for that changing in the near future, this analysis will focus on the Google Play Store and Apple App Store exclusively.

<img src="https://s3.amazonaws.com/dq-content/350/py1m8_statista.png" alt="Marketplace Statistics Chart" title="Marketplace Statistics" />


### Primary Goals:
The primary goals of this analysis are as follows:
- Focus on data that matches our company profile
    - Ad based revenue strategy
    - Free to download and install
    - Targeted towards English speakers
    
- Cleaning data
    - Removing duplicate data
    - Removing or correcting erroneus data
    
- Analysis:
    - Determining the number of apps available by genre
    - Estimating the popularity of each genre with end users
    - Recommending a genre target for our future app development
    

### Results:
Games are clearly the most dominant type of app in both the Apple App Store and Google Play Store as both a measure of download and installations and popularity.

It is recommended to pursue development of game applications to maximize chances of future success.

<img src="https://www.apple.com/v/app-store/a/images/overview/icon_appstore__ev0z770zyxoy_large.png" width=40px alt="App Store Logo" title="App Store Logo" align='left'/>

#### Apple App Store

##### Market Share by Genre
|App Store Genre| Market Share |
|--------------| -------------|
|Games | 58.16% |
|Entertainment | 7.88% |
|Photo & Video | 4.97%|
|Education | 3.66%|
|Social Networking | 3.29%|

##### Estimated Downloads by Genre
|App Store Genre| Estimated Downloads |
|--------------| -------------|
|Games | 53.39%|
|Social Networking | 9.48%|
|Photo & Video | 5.69%|
|Music | 4.73%|
|Entertainment | 4.46%|

<img src="https://www.notebookcheck.net/fileadmin/Notebooks/News/_nc3/20180131_Play_Store_Logo.png" alt="Play Store Logo" title="Play Store Logo" width=50px align='left'/>

#### Google Play Store  
##### Market Share by Genre
|Play Store Genre| Market Share |
|--------------| -------------|
|FAMILY | 18.91%|
|GAME | 9.73%|
|TOOLS | 8.46%|
|BUSINESS | 4.59%|
|LIFESTYLE | 3.9%|

##### Estimated Downloads by Genre
|Play Store Genre| Estimated Downloads |
|--------------| -------------|
|GAME | 17.86%|
|COMMUNICATION | 14.67%|
|TOOLS | 10.77%|
|FAMILY | 8.23%|
|PRODUCTIVITY | 7.7%| 

## Data Sets
### App Store
Data for approximately 7,000 apps on the App Store was collected in July, 2017 and stored in CSV format.
#### Data Format

|Column Header| Explanation | Index|
|-----------  | --------   | ------|
|id| App ID| 0|
|track_name| App Name| 1 |
|size_bytes| File Size| 2 |
|currency  | Currency | 3 |
|price     | App Price| 4 |
|rating_count_tot| Number of Reviews | 5  |
|rating_count_ver| Rating Count Version | 6 |
|user_rating| Average User Rating | 7 |
|user_rating_ver| User Rating Version | 8 |
|ver | App Version | 9|
|cont_rating | Content Rating | 10|
|prime_genre | App Genre | 11 |
|sup_devices.num | Supported Devices | 12 |
|ipadSc_urls.num | Supported Ipad | 13
|lang.num | Language | 14 |
|vpp_lic | License | 15|

#### Download
[Download CSV File](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

### Google Play Store
Data for approximately 10,000 apps on the Google Play Store was collected in August, 2018 and stored in CSV format.
#### Data Format
|Column Header| Explanation | Index|
|-----------  | --------   | ------|
|App| App Name| 0 |
|Category | Appe Genre | 1 |
|Rating| Average User Rating | 2|
|Reviews| Number of Reviews| 3 |
| Size| File Size| 4
|Installs| Number of Downloads| 5 |
|Type | Type | 6 |
|Price| App Price | 7 |
|Content Rating | Content Rating| 8|
|Genres| Genre Group | 9 |
|Last Updated| Date Last Updated | 10 |
|Current Ver| Current App Version| 11|
|Android Ver| Android Version | 12 |

#### Download
[Download CSV File](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv)

## Reading Data Sets

In [1]:
from csv import reader

# Source Data
app_store_file = 'AppleStore.csv'
google_play_file = 'googleplaystore.csv'

# Opening Data Set Files
opened_app_store_file = open(app_store_file)
opened_google_play_file = open(google_play_file)
# Reading Files
read_app_store_file = reader(opened_app_store_file)
read_google_play_file = reader(opened_google_play_file)
# Creating List
app_store_data = list(read_app_store_file)
google_play_data = list(read_google_play_file)

## Exploring Data

### explore_data()
The explore_data() functon can be used to quickly view a section of the data.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))
        print('\n')

In [3]:
print(explore_data(app_store_data, 0, 3, True))
print(explore_data(google_play_data, 0, 3, True))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows:  7198
Number of columns:  16


None
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 

## Incorrect Data
### Google Play Store - Missing Column and Data Shift
The following error was reported in [this discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).  Based on review it appears that the "Category" column was deleted and the data was shifted accordingly.

In [4]:
print(explore_data(google_play_data, 0, 1, False))
print(explore_data(google_play_data, 10473, 10474, False))

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


None
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


None


#### Action:
Row was deleted and confirmed by accessing row 10472 and comparing total number of rows.

In [5]:
del google_play_data[10473]
print(explore_data(google_play_data, 10473, 10474, True))

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Number of rows:  10841
Number of columns:  13


None


## Duplicate Data

### duplicate_data()
The duplicate data function will determine how many duplicate apps (based on the name) are present in the data.  It can also return a list of all unique app names and a dictionary containing the number of duplicate entries by name.

In [6]:
def duplicate_data(app_store, app_name_col):
    # Empty dictionary to contain the name and number of duplicate entries
    app_store_unique = []
    app_store_duplicates = {}

    # Loops over all data skipping the header row
    for app_data in app_store[1:]:
        name = app_data[app_name_col]

        # Checks if app is a duplicate (already exists in unique set)
        if name in app_store_unique:
            # Adds app or increments the dictionary
            if name in app_store_duplicates:
                app_store_duplicates[name] += 1
            else:
                app_store_duplicates[name] = 1
        else:
            app_store_unique.append(name)

    print("Number of unique apps in the store: " + str(len(app_store_unique)))
    print("Number of apps with at least one duplicate: " + str(len(app_store_duplicates)))
    return app_store_unique, app_store_duplicates

### Results
#### App Store

In [7]:
app_store_unique, app_store_duplicates = duplicate_data(app_store_data, 1)

Number of unique apps in the store: 7195
Number of apps with at least one duplicate: 2


Two potential duplicates were identified.  We will consider duplicates only if the App ID is the same (column 0 of the CSV file).

In [8]:
app_id = []
confirmed_duplicates = []
for row in app_store_data:
    if row[1] in app_store_duplicates:
        if row[0] in app_id:
            confirmed_duplicates.append(row)
        else:
            app_id.append(row[0])

print("There are", len(confirmed_duplicates), "confirmed duplicate entries.")
    

There are 0 confirmed duplicate entries.


#### Google Play Store

In [15]:
google_play_unique, google_play_duplicates = duplicate_data(google_play_data, 0)

Number of unique apps in the store: 9659
Number of apps with at least one duplicate: 798


798 potential duplicates were identified.  Since there is no App ID, we will consider all of these duplicates.

#### Action:
All duplicate data for a particular app, except for the most recent, for a particular app will be removed.  It will be assumed that the data with the most reviews is the latest data set and will be preserved.

In [19]:
# Dictonary to contain the maximum number of reviews for each app
android_reviews_max = {}

for app_data in google_play_data[1:]:
    name = app_data[0]
    n_reviews = float(app_data[3])
    
    # Check if the app is in the dictionary
    if name in android_reviews_max:
        # Check if reviews is greater than current value in dictionary
        if android_reviews_max[name] < n_reviews:
            android_reviews_max[name] = n_reviews
    else:
        # If not in dictionary, create a new kew and assign reviews
        android_reviews_max[name] = n_reviews
        
android_clean = []
already_added = []

for app_data in google_play_data[1:]:
    name = app_data[0]
    n_reviews = float(app_data[3])
    
    if n_reviews == android_reviews_max[name] and name not in already_added:
        android_clean.append(app_data)
        already_added.append(name)
        
print('Number of apps in cleaned Google Play Store data:', len(android_clean))

Number of apps in cleaned Google Play Store data: 9659


## Remove Non-English Apps
Because our app is targeted at the English speaking world, we will remove data for apps whose primary target is for non-English speaking users.
### Method
In order to identify non-English speaking apps, we will attempt the quantify the likelihood an app is targeted for English speakers.  

We will use the following method:
1. Identify non-English characters as any character that falls outside of the range of ASCII characters (0 through 127)
2. Determine the number of non-ASCII characters in the name of the app
3. If the app has more than 3 non-ASCII characters it will be removed from the data set.

In [11]:
def eng_app(string):
    bad_char = 0
    for char in string:
        if ord(char) > 127:
            bad_char += 1
    if bad_char > 3:
        return False
    else:
        return True
 
apple_english = []
android_english = []

for app_data in app_store_data[1:]:
    if eng_app(app_data[1]):
        
        apple_english.append(app_data)

for app_data in android_clean[1:]:
    if eng_app(app_data[0]):
        android_english.append(app_data)
        
print("App store English apps: " + str(len(apple_english)))      
print("Google Play English apps: " + str(len(android_english)))

App store English apps: 6183
Google Play English apps: 9613


## Selecting Free Apps
Since the primary business model is to provide free apps for add revenue, we will remove any apps with a price greater than 0.

In [12]:
free_apple_apps = []
free_google_apps = []

for app_data in apple_english:
    price = app_data[4]
    price = ''.join(n for n in price if n.isalnum())
    price = float(price)
    if price == 0:
        free_apple_apps.append(app_data)

for app_data in android_english:
    price = app_data[7]
    price = ''.join(n for n in price if n.isalnum())
    price = float(price)
    if price == 0:
        free_google_apps.append(app_data)
        
print("Apple store FREE apps: " + str(len(free_apple_apps)))
print("Google Play FREE apps: " + str(len(free_google_apps)))

Apple store FREE apps: 3222
Google Play FREE apps: 8863


## Data Analysis
### Most Common Genres
From our analysis below, we find that Games is the dominant app on the Apple App Store with more than 50% market share of all available apps.  Similarly, we find that Family and Games are the two highest rated apps on the Google Play Store with almost 29% market share.

In [17]:
def freq_table(data_set, col):
    table = {}
    count = 0
    for data in data_set:
        count += 1
        if data[col] not in table:
            table[data[col]] = 1 
        else:
            table[data[col]] += 1
    for key in table:
        table[key] = round(table[key] / count * 100.0, 2)
    return table

apple_genre_freq_table = freq_table(free_apple_apps, 11)
android_genre_freq_table = freq_table(free_google_apps, 1)

def display_table(data_set, col, percent=False):
    table = freq_table(data_set, col)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse=True)
    if percent:
        for entry in table_sorted:
            print(entry[1], ':', str(entry[0])+'%')
    else:
        for entry in table_sorted:
            print(entry[1], ':', entry[0])
        
print('App Store Apps')
display_table(free_apple_apps, 11, percent=True)
print('\nGoogle Play Store Apps')
display_table(free_google_apps, 1, percent=True)

App Store Apps
Games : 58.16%
Entertainment : 7.88%
Photo & Video : 4.97%
Education : 3.66%
Social Networking : 3.29%
Shopping : 2.61%
Utilities : 2.51%
Sports : 2.14%
Music : 2.05%
Health & Fitness : 2.02%
Productivity : 1.74%
Lifestyle : 1.58%
News : 1.33%
Travel : 1.24%
Finance : 1.12%
Weather : 0.87%
Food & Drink : 0.81%
Reference : 0.56%
Business : 0.53%
Book : 0.43%
Navigation : 0.19%
Medical : 0.19%
Catalogs : 0.12%

Google Play Store Apps
FAMILY : 18.91%
GAME : 9.73%
TOOLS : 8.46%
BUSINESS : 4.59%
LIFESTYLE : 3.9%
PRODUCTIVITY : 3.89%
FINANCE : 3.7%
MEDICAL : 3.53%
SPORTS : 3.4%
PERSONALIZATION : 3.32%
COMMUNICATION : 3.24%
HEALTH_AND_FITNESS : 3.08%
PHOTOGRAPHY : 2.94%
NEWS_AND_MAGAZINES : 2.8%
SOCIAL : 2.66%
TRAVEL_AND_LOCAL : 2.34%
SHOPPING : 2.25%
BOOKS_AND_REFERENCE : 2.14%
DATING : 1.86%
VIDEO_PLAYERS : 1.79%
MAPS_AND_NAVIGATION : 1.4%
FOOD_AND_DRINK : 1.24%
EDUCATION : 1.16%
ENTERTAINMENT : 0.96%
LIBRARIES_AND_DEMO : 0.94%
AUTO_AND_VEHICLES : 0.93%
HOUSE_AND_HOME : 0.82%

## Genre Popularity
Game type apps are the most popular on the Apple App Store with more than 50% of estimated downloads (estimated by user-rating count totals).  Similarly, we find that Games is the most downloaded type of app on the Google Play Store.

In [14]:
def genre_table(data_set, genre, installs):
    total_installs = 0   
    table = {}
    for data in data_set:
        genre_val = data[genre]
        
        installs_val = data[installs]
        installs_val = ''.join(x for x in installs_val if x.isnumeric())
        installs_val = int(installs_val)
        total_installs += installs_val
        
        if genre_val not in table:
            table[genre_val] = installs_val
        else:
            table[data[genre]] += installs_val
            
    for key in table:
        table[key] = round(table[key] / float(total_installs) * 100,2)
    return table

def sort_dict(dictionary, percent=True):
    new_dict = []
    for key in dictionary:
        key_val_as_tuple = (dictionary[key], key)
        new_dict.append(key_val_as_tuple)
        
    sorted_dict = sorted(new_dict, reverse=True)
    if percent:
        for entry in sorted_dict:
            print(entry[1], ':', str(entry[0])+'%')
    else:
        for entry in sorted_dict:
            print(entry[1], ':', entry[0])
            
print('App Store Estimated Installs (Based on Rating Count):')
apple_genre_data = genre_table(free_apple_apps, 11, 5)
sort_dict(apple_genre_data)            
print()
print('Google Play Store Installs By Genre:')            
android_genre_data = genre_table(free_google_apps, 1, 5)
sort_dict(android_genre_data)

App Store Estimated Installs (Based on Rating Count):
Games : 53.39%
Social Networking : 9.48%
Photo & Video : 5.69%
Music : 4.73%
Entertainment : 4.46%
Shopping : 2.83%
Sports : 1.98%
Utilities : 1.89%
Health & Fitness : 1.89%
Weather : 1.83%
Reference : 1.69%
Productivity : 1.47%
Finance : 1.42%
Travel : 1.41%
News : 1.14%
Food & Drink : 1.08%
Lifestyle : 1.05%
Education : 1.03%
Book : 0.7%
Navigation : 0.65%
Business : 0.16%
Catalogs : 0.02%
Medical : 0.0%

Google Play Store Installs By Genre:
GAME : 17.86%
COMMUNICATION : 14.67%
TOOLS : 10.77%
FAMILY : 8.23%
PRODUCTIVITY : 7.7%
SOCIAL : 7.29%
PHOTOGRAPHY : 6.19%
VIDEO_PLAYERS : 5.22%
TRAVEL_AND_LOCAL : 3.85%
NEWS_AND_MAGAZINES : 3.15%
BOOKS_AND_REFERENCE : 2.21%
PERSONALIZATION : 2.03%
SHOPPING : 1.86%
HEALTH_AND_FITNESS : 1.52%
SPORTS : 1.46%
ENTERTAINMENT : 1.31%
BUSINESS : 0.93%
MAPS_AND_NAVIGATION : 0.67%
LIFESTYLE : 0.66%
FINANCE : 0.6%
WEATHER : 0.48%
FOOD_AND_DRINK : 0.28%
EDUCATION : 0.25%
DATING : 0.19%
ART_AND_DESIGN : 0.