<style>
.text_cell_render {
font-family: Times New Roman, serif;
}
</style>

In [77]:
!jt -t grade3

# Finding Profitable App Profiles for the Apple App Store and Google Play Markets


             
####  The goal of this project is to help our developers (at our imaginary company) understand what apps are likely to attract the most users. At our company, we only build apps that are free to download and install, and that are directed toward an English-speaking audience. Our revenue is highly influenced by the number of people using our apps.

### Business Strategy
#### Our validation strategy is based on our overall business strategy: 
1. ##### Build a minimal Android version of the app, and add it to Google Play.
2. ##### If the app has a good response from users, we develop it further.
3. ##### If the app is profitable after six months, we build an iOS version of the app and add it to the Apple App Store.

### Methodology 
#### After we cleaned the data, we found the percentage of apps that existed within each genre for both the App Store and Google Play Store. Then we drilled down to the user ratings and reviews to determine customer preferences. 

### Results Summary
#### We recommend that our developers prioritize general, practical, and/or educational applications for the Google Play Store Market. If these are successful after six months, launch them on the Apple App Store.  If management wants to grow the market, and if resources exist, they should consider re-evaluating the business strategy to launch entertainment and gaming apps directly to the Apple App Store.  
#### For more details, please refer to the full analysis below. 
             
        

# Exploring a Sampling of Existing Data Sets
####  As of September 2018, there were approximately 2 million iOS apps available on the Apple App Store and 2.1 million Android apps on Google Play. 

- Statista: https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/

#### Collecting data for over 4 million apps was not practical for this project, and subsets of the data already existed on kaggle.com Here are the datasets for Apple iOS App Store and Google Play Store that were used for this analysis: 
- https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps
- https://www.kaggle.com/lava18/google-play-store-apps

#### The App Store Data was collected in July 2017 from an API. Version 6 of the Google Play Store Data was scraped in February 2019.

In [78]:
# Open both data sets and save as list of lists
from csv import reader
def file_import(file):   
    open_file = open(file, encoding = 'utf8')
    read_file = reader(open_file)
    return list(read_file)
    
apple = file_import('AppleStore.csv')
google = file_import('googleplaystore.csv')

In [79]:
# Explore both data sets 
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

explore_data(apple, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [80]:
# Apple App Store columns
apple[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [81]:
explore_data(google, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [82]:
# Google Store Columns
google[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

# Determining the Relevant Information from the Data
#### There are 7198 rows (including the header) and 16 columns in the Apple App Store Data. Since we're interested in determining the most popular genres within the free apps, we focused on the following columns: 
* #### "track_name" (App Name)
* #### "price"
* #### "ratingcounttot" (User Rating counts (for all version))
* #### "prime_genre": (Primary Genre)
#### The Google Play Store Data contained 10842 rows (including the header) and 13 columns. We focused on the following columns: 
* #### "App": App Name
* #### "Category"
* #### "Reviews"
* #### "Installs"
* #### "Price" 
* #### "Genres"

    

# Handling Inaccuracies, Duplicates, Removing Non-English and Not-Free Apps. 
#### There is a wrong rating for Google Play data set for entry 10472 causing the columns to shift according to a discussion on Kaggle: https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015. The column is shifted due to missing Category field, so we deleted it.  
#### There were also 1181 duplicate apps in the Google Play Store. Since we were interested in the number of current reviews, we kept the ones with the highest ratings (since they were the most current versions), and dropped the rest. 

#### The App Store data did not contain inaccuracies or duplicates that impacted our data. 
#### Both data sets contained non-English and not-free apps, so we removed those and created clean datasets. 

In [83]:
# Find row with wrong data
explore_data(google, 10473, 10474, True)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Number of rows: 10842
Number of columns: 13


In [84]:
del google[10473]

In [85]:
# Count the number of duplicates vs. unique apps. 
# Print an example of one of the duplicates
duplicate_apps = []
unique_apps = []
for app in google[1:]: 
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else: 
        unique_apps.append(name)
print('Number of duplicate apps in Google Play:', len(duplicate_apps))
print('Number of unique apps: Google Play', len(unique_apps))
print('Example of duplicate apps in Google Play:', duplicate_apps[0:2])

Number of duplicate apps in Google Play: 1181
Number of unique apps: Google Play 9659
Example of duplicate apps in Google Play: ['Quick PDF Scanner + OCR FREE', 'Box']


In [86]:
# 'Quick PDF Scanner + OCR FREE' is a duplicate app. The 4th entry is the number of reviews. 
for app in google[1:]:
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


In [87]:
# Find initial length and determine the expected lenght after removing duplicates
len(google[1:]) -1181

9659

In [88]:
# Create a dictionary where key = unique app, value = highest review
# Make sure the number of entries matches what was expected (9659)
reviews_max_goog = {}
for row in google[1:]:
    name = row[0]
    n_reviews = int(row[3])
    if name in reviews_max_goog and reviews_max_goog[name] < n_reviews: 
        reviews_max_goog[name] += n_reviews
    elif name not in reviews_max_goog:
        reviews_max_goog[name] = n_reviews
print('The number of entries:', len(reviews_max_goog),'matches what was expected.')

The number of entries: 9659 matches what was expected.


In [89]:
# Revove duplicates
google_clean = []
already_added = []
for row in google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == (reviews_max_goog[name]) and (name not in already_added):
        google_clean.append(row)
    else: 
        already_added.append(row)   

In [90]:
print('Number of rows of cleanded data:', len(google_clean))

Number of rows of cleanded data: 9667


In [91]:
print('Number of rows of duplicates for checking purposes:', len(already_added))

Number of rows of duplicates for checking purposes: 1173


In [92]:
# Function for removing more than 3 non-english characters in the name of the app. 
# Takes a string as an argument and returns 'True' if english or 'False' if not.
def is_english(string):
    count = 0
    for char in string:        
        if ord(char) > 127:
            count += 1
            if count > 4: 
                return False           
    else: 
        return True
# Test with the following strings
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In [93]:
# Use function 'is_english' on both data sets If app passes as English, append to a list.
# Determine the number of rows remaining in each data set. 
apple_english = []
for row in apple[1:]:
    name = row[1]
    if is_english(name) == True:
        apple_english.append(name)
        
google_english = []
for row in google_clean:
    name = row[0]
    if is_english(name) == True:
        google_english.append(name)  
        
print('Number of Apple apps before removing non-English apps:', len(apple[1:]))
print('Number of Apple apps after removing non-English apps:', len(apple_english))
print('Number of Apple apps removed:', len(apple[1:])-len(apple_english))

print('Number of Google apps before removing non-English apps:', len(google_clean[1:]))
print('Number of Google apps after removing non-English apps:', len(google_english))
print('Number of Google apps removed:', len(google_clean[1:])-len(google_english))

Number of Apple apps before removing non-English apps: 7197
Number of Apple apps after removing non-English apps: 6240
Number of Apple apps removed: 957
Number of Google apps before removing non-English apps: 9666
Number of Google apps after removing non-English apps: 9627
Number of Google apps removed: 39


In [94]:
#Isolate the free apps by retaining them in a list
free_apple = []
for row in apple[1:]:
    price = float(row[4])
    app = row[1]
    if app in apple_english and price == 0:
        free_apple.append(row)  
        
free_google = []
for row in google[1:]:
    price = row[6]
    app = row[0]
    if app in google_english and price == 'Free':
        free_google.append(row) 
        
print('Number of Apple apps remaining that are free:', len(free_apple))
print('Number of Google apps remaining that are free:', len(free_google))

Number of Apple apps remaining that are free: 3263
Number of Google apps remaining that are free: 9026


# Finding the Most Common Genres for Each Market
#### Our analysis found that 57.73% of the App Store apps fall under the 'Games' genre. 'Entertainment' was a far second at around 7.97%.  From the volume of gaming apps, it appears that Apple is catering to that market. 
#### Google apps were geared toward, practicality, family and education. We used both the 'Category' and the 'Genres' fields for our analysis. We found their apps more evenly distributed with no particular category or genre standing out.  8.45% of the Google Play apps fall under the 'FAMILY' category. 'GAME' was half that at 9.00%. 8.36% fall under the 'Tools' genre. 'Entertainment' was 6.13%.    

In [95]:
# The frequency table function takes two inputs: dataset and index, for any column
# the display table function prettifies the frequence table to show the percentages per genre

def freq_table(dataset, index):
    app_freq = {}
    for row in dataset:
        genre = row[index]
        if genre in app_freq: 
            app_freq[genre] += 1
        else:
            app_freq[genre] = 1
    total_number_of_apps = len(dataset)
    total_number_of_apps_times_100 = .01
    
    for value in app_freq: #convert frequency table to percentages
        app_freq[value]/=total_number_of_apps #'/=' is the iterable
    return (app_freq)
    
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

apple_prime_genre = display_table(free_apple, 11)

Games : 0.5773827765859638
Entertainment : 0.0796812749003984
Photo & Video : 0.04934109714986209
Education : 0.03616304014710389
Social Networking : 0.03279190928593319
Shopping : 0.026662580447441006
Utilities : 0.02513024823781796
Sports : 0.021146184492798037
Music : 0.02022678516702421
Health & Fitness : 0.0199203187250996
Productivity : 0.017775053631627336
Lifestyle : 0.01624272142200429
News : 0.013178057002758198
Travel : 0.012871590560833588
Finance : 0.01256512411890898
Weather : 0.008887526815813668
Food & Drink : 0.008887526815813668
Reference : 0.0055163959546429666
Business : 0.0055163959546429666
Book : 0.004596996628869139
Navigation : 0.002451731535396874
Medical : 0.0018387986515476554
Catalogs : 0.001225865767698437


In [96]:
google_Catagories = display_table(free_google, 1) # Catagories


FAMILY : 0.18446709505871925
GAME : 0.09007312209173499
TOOLS : 0.08375803235098604
BUSINESS : 0.04797252382007534
PRODUCTIVITY : 0.04165743407932639
LIFESTYLE : 0.03944161311766009
MEDICAL : 0.03833370263682694
FINANCE : 0.035785508530910705
SPORTS : 0.0335696875692444
PERSONALIZATION : 0.0335696875692444
HEALTH_AND_FITNESS : 0.032018612896078
COMMUNICATION : 0.030024374030578328
NEWS_AND_MAGAZINES : 0.029692000886328385
PHOTOGRAPHY : 0.028805672501661866
SOCIAL : 0.0268114336361622
SHOPPING : 0.026257478395745625
TRAVEL_AND_LOCAL : 0.024041657434079326
BOOKS_AND_REFERENCE : 0.020385552847329937
DATING : 0.019499224462663417
VIDEO_PLAYERS : 0.018058940837580324
MAPS_AND_NAVIGATION : 0.014513627298914247
EDUCATION : 0.012851761577664525
ENTERTAINMENT : 0.011522269000664746
FOOD_AND_DRINK : 0.011300686904498116
LIBRARIES_AND_DEMO : 0.009084865942831819
AUTO_AND_VEHICLES : 0.009084865942831819
HOUSE_AND_HOME : 0.008309328606248615
WEATHER : 0.007976955461998671
EVENTS : 0.006979836029248

In [97]:
google_Genres = display_table(free_google, 9) # Genres

Tools : 0.08364724130290273
Entertainment : 0.06137824063815644
Education : 0.053844449368491025
Business : 0.04797252382007534
Productivity : 0.04165743407932639
Lifestyle : 0.039330822069576776
Medical : 0.03833370263682694
Finance : 0.035785508530910705
Sports : 0.03390206071349435
Personalization : 0.0335696875692444
Health & Fitness : 0.032018612896078
Communication : 0.030024374030578328
News & Magazines : 0.029692000886328385
Action : 0.029248836693995126
Photography : 0.028805672501661866
Social : 0.0268114336361622
Shopping : 0.026257478395745625
Travel & Local : 0.02393086638599601
Simulation : 0.020385552847329937
Books & Reference : 0.020385552847329937
Dating : 0.019499224462663417
Video Players & Editors : 0.01794814978949701
Arcade : 0.017172612452913804
Casual : 0.016397075116330602
Maps & Navigation : 0.014513627298914247
Food & Drink : 0.011300686904498116
Puzzle : 0.010192776423664968
Racing : 0.009195656990915135
Libraries & Demo : 0.009084865942831819
Auto & Vehicl

# Finding the Most Popular Genres with Users
#### In the previous steps, we determined the percentage of apps that existed within each genre for both the App Store and Google Play Store. Then we drilled down to the user ratings and reviews to determine customer preferences. 
#### We found that for the App Store Social Networking apps had the highest user ratings, followed by 'Photo & Video', 'Games', and Music. 
#### By contrast, the Google Play Store's 'Art and Design' genre was the most popular followed by 'Auto and Vehicles, 'Beauty', Books and Business.

In [98]:
# Generate a tuple of unique genres for the prime_genre column using the freq_table function
apple_freq_table = freq_table(free_apple, 11)
unique_apps = dict.keys(apple_freq_table)
print(unique_apps)

dict_keys(['Social Networking', 'Photo & Video', 'Games', 'Music', 'Reference', 'Health & Fitness', 'Weather', 'Utilities', 'Travel', 'Shopping', 'News', 'Navigation', 'Lifestyle', 'Entertainment', 'Food & Drink', 'Sports', 'Book', 'Finance', 'Education', 'Productivity', 'Business', 'Catalogs', 'Medical'])


In [99]:
# Loop over the unique genres and compute the average number of user ratings
ave_ratings_per_apple_app = {} 
for genre in unique_apps: 
    total = 0
    len_genre = 0
    for row in free_apple:
        genre_app = row[11]
        if genre_app == genre:
            rating = float(row[5])
            total += rating
            len_genre += 1
    ave_rating = total/len_genre
    ave_ratings_per_apple_app[genre] = ave_rating

ave_ratings_per_apple_app            

{'Social Networking': 70884.73831775702,
 'Photo & Video': 28264.888198757762,
 'Games': 22667.712845010617,
 'Music': 57326.530303030304,
 'Reference': 74942.11111111111,
 'Health & Fitness': 23298.015384615384,
 'Weather': 50477.137931034486,
 'Utilities': 18460.353658536584,
 'Travel': 26925.166666666668,
 'Shopping': 25996.32183908046,
 'News': 21248.023255813954,
 'Navigation': 64667.375,
 'Lifestyle': 15863.77358490566,
 'Entertainment': 13727.292307692307,
 'Food & Drink': 29885.758620689656,
 'Sports': 23008.898550724636,
 'Book': 37217.73333333333,
 'Finance': 27638.243902439026,
 'Education': 7003.983050847458,
 'Productivity': 20360.241379310344,
 'Business': 7075.333333333333,
 'Catalogs': 4004.0,
 'Medical': 612.0}

In [100]:
# Repeat steps above for the Google Play Store
google_freq_table = freq_table(free_google, 1)
unique_g_apps = dict.keys(google_freq_table)
print(unique_g_apps)

dict_keys(['ART_AND_DESIGN', 'AUTO_AND_VEHICLES', 'BEAUTY', 'BOOKS_AND_REFERENCE', 'BUSINESS', 'COMICS', 'COMMUNICATION', 'DATING', 'EDUCATION', 'ENTERTAINMENT', 'EVENTS', 'FINANCE', 'FOOD_AND_DRINK', 'HEALTH_AND_FITNESS', 'HOUSE_AND_HOME', 'LIBRARIES_AND_DEMO', 'LIFESTYLE', 'GAME', 'FAMILY', 'MEDICAL', 'SOCIAL', 'SHOPPING', 'PHOTOGRAPHY', 'SPORTS', 'TRAVEL_AND_LOCAL', 'TOOLS', 'PERSONALIZATION', 'PRODUCTIVITY', 'PARENTING', 'WEATHER', 'VIDEO_PLAYERS', 'NEWS_AND_MAGAZINES', 'MAPS_AND_NAVIGATION'])


In [101]:
# Repeat steps above for the Google Play Store
# Clean the'Installs' field because it's in string format with "+" and ","
ave_installs_per_google_app = {}
for key in unique_g_apps:
    category = key
    total = 0
    len_category = 0
    for row in free_google:
        category_app = row[1]
        if category_app == category:
            installs = (row[5])
            installs = installs.replace('+', '') 
            installs = installs.replace(',', '')
            installs = int(installs)
            total += installs
            len_category += 1
    ave_installs = total/len_category
    ave_installs_per_google_app[key] = ave_installs
    
ave_installs_per_google_app      

{'ART_AND_DESIGN': 1843233.9285714286,
 'AUTO_AND_VEHICLES': 647317.8170731707,
 'BEAUTY': 513151.88679245283,
 'BOOKS_AND_REFERENCE': 7719534.021739131,
 'BUSINESS': 2005663.0254041571,
 'COMICS': 664042.1568627451,
 'COMMUNICATION': 36977513.65682657,
 'DATING': 590849.1875,
 'EDUCATION': 1959913.7931034483,
 'ENTERTAINMENT': 11244807.692307692,
 'EVENTS': 253542.22222222222,
 'FINANCE': 846635.0835913313,
 'FOOD_AND_DRINK': 1829791.6764705882,
 'HEALTH_AND_FITNESS': 3839094.8166089966,
 'HOUSE_AND_HOME': 1502699.48,
 'LIBRARIES_AND_DEMO': 524339.1463414634,
 'LIFESTYLE': 1430658.5084269664,
 'GAME': 9651745.940959409,
 'FAMILY': 2904742.756756757,
 'MEDICAL': 144183.2485549133,
 'SOCIAL': 29453974.801652893,
 'SHOPPING': 8881175.464135021,
 'PHOTOGRAPHY': 13304880.057692308,
 'SPORTS': 3583269.580858086,
 'TRAVEL_AND_LOCAL': 4731355.235023041,
 'TOOLS': 14695824.701058201,
 'PERSONALIZATION': 7492527.683168317,
 'PRODUCTIVITY': 20597418.38829787,
 'PARENTING': 542603.6206896552,
 'W

# Conclusion
#### In this project, we analyzed samples of data from the Apple App Store and Google Play Store to develop an apps profile that our developers could use to target certain markets and audiences. We focused on free, English language apps since that was our business criteria. The Apps Store was highly composed of 'fun' applications; particularly gaming, which was over 50% of their content. Their customers seemed to prefer the 'Social Networking', 'Photos & Videos' and 'Gaming' apps that they offered.   
#### By contrast, the Google Play Store content was much broader and didn't have a dominating category or genre. Their apps were more focused on practical tools and applications, family, and education. Their customers seemed to prefer practical, and educational apps they offered.
#### Based on our findings, we recommend that our developers prioritize general, practical, and/or educational applications for the Google Play Store Market. If these are successful after six months, launch them on the Apple App Store.  If management wants to expand the market, and if resources exist, they should consider re-evaluating the business strategy to launch entertainment and gaming apps directly to the Apple App Store.  