# App Usage Analysis - Which type of apps do users gravitate towards?

In this project, we will be analyzing usage statistics of our apps to figure out what type of apps have the most users. The goal of doing is then share this information with the development team so the company can start developing more apps like the most popular ones to get more users and get more ad revenue.

### Google Play Store Dataset:
> https://www.kaggle.com/lava18/google-play-store-apps/home

### Apple App Store Dataset:
> https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home

In [2]:
# Function for exploring our datasets
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# bring out the reader function from the csv module
from csv import reader

# Open the google dataset and read it into a list
opened_google_file = open("google-play-store-apps/googleplaystore.csv")
read_google_file = reader(opened_google_file)
google_data_set = list(read_google_file)

# open the apple dataset and read it into a list
opened_apple_file = open("app-store-apple-data-set-10k-apps/AppleStore.csv")
read_apple_file = reader(opened_apple_file)
apple_data_set = list(read_apple_file)


In [4]:
# explore the google dataset
explore_data(google_data_set, 0, 5, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
# explore the apple dataset
explore_data(apple_data_set, 0, 5, rows_and_columns=True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7198
Number of columns: 17


# Columns of Google App Data Worth Exploring for most popular Google apps per genre

To explore which genres are most popular in the Google App dataset, we would want to pull the Genres column in addition to the Installs column to show how many downloads there were per column. In addition, we can also look at the average rating per genre to get an idea of which genre is the highest rated.

- Category
- Ratings
- Installs
- Genres

For details on what each column name means, please go to the following link:
> https://www.kaggle.com/lava18/google-play-store-apps

# Columns of Apple App Store Data Worth Exploring for most downloads per genre

To explore which genres are most popular in the Apple AppStore dataset, we would want to pull the prime_genre column in addition to the rating_count_tot and user_rating columns to help us find the average rating per genre.

- rating_count_tot
- user_rating
- prime_genre

# Data Cleansing Pt. 1 - Outliers and Duplicates

Before getting too far into analyzing the data, let's make sure we do some data cleansing first.

In [6]:
# print the header for our reference
print(google_data_set[0])
# From the kaggle discussion board, a user pointed out the following row has an issue:
print(google_data_set[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


It looks like the above row has a rating of "19", which is impossible as the max rating is a 5 in this dataset. To avoid skewing our results incorrectly, let's delete this row

In [7]:
# delete the invalid row
del google_data_set[10473]
# show that the row has been deleted now
print(google_data_set[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


From looking at another discussion item on the Kaggle page, it looks like someone pointed out that there are duplicates in the Google Play Store dataset. Let's dign in and find out.

In [8]:
# create a couple of lists for storing app names and duplicates
duplicate_apps = []
unique_apps = []

# create a for loop going through each of the apps in the Google Play store
for app in google_data_set:
    # assign a variable to the app name
    name = app[0]
    
    # if the app is in the unique_apps list already, add it to duplicate_apps
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
# print a few of the duplicate apps
print(duplicate_apps[0:5])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [9]:
# let's look at the the total number of duplicate apps
print(len(duplicate_apps))

1181


Now we just need to remove these duplicates. We could remove them at random; however, we might accidentally delete the entry that has the most number of reviews, which more than likely incidates the entry with the hgiher download and more clear user rating. Let's make sure to leave the entry that has the most rows and delete the duplicates.

In [10]:
# create a dictionary for storing the max number of reviews for each duplicate
reviews_max = {}

# loop through the google_data_set and don't include the header
for app in google_data_set[1:]:
    # assign the name of the app to a variable
    name = app[0]
    
    # convert the number of reviews to a float and assign it to a variable
    n_reviews = app[3]
    
    # check to see if the app is in the list and the current key's number of reviews is less than the number of review we 
    # have. If not, add it to the list along with its number of reviews
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    
    # check to see if the app is not in the list. If not, add it to the list along with its number of reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

In [11]:
# print the number of records we have. It should be 9659
print(len(reviews_max))

9659


In [12]:
# Test out one duplicate's max number of reviews to make sure it is correct
print(reviews_max['Quick PDF Scanner + OCR FREE'])

80805


In [13]:
# list for checking the number of reviews for each entry
number_of_reviews = []

# test that I am pulling the max correctly
for app in google_data_set:
    if app[0] == 'Quick PDF Scanner + OCR FREE':
        number_of_reviews.append(app[3])

In [14]:
print(number_of_reviews)

['80805', '80805', '80804']


Now that we have the rows that we want to keep, let's make a clean google_data_set

In [15]:
# List for clean data set
android_clean = []
# List for just storing app names
already_added = []

We'll now loop through the google_data_set and store the row where the number of reviews matches what is in our reviews_max dictionary AND the app is not in our already_added list

In [16]:
# for loop for going through the Google Play store (excluding the header)
for app in google_data_set[1:]:
    # assign the name value to a variable
    name = app[0]
    # assign the number of reviews to a variable
    n_reviews = app[3]
    
    # check to see if the app is NOT in the already_added list and has a number of reviews that matches what is in
    # the reviews_max dictionary. If so, then add the app to the cleaned dataset and the already_added list
    if name not in already_added and n_reviews == reviews_max[name]:
        # add the whole app row to the cleaned dataset
        android_clean.append(app)
        # add the name of the app to the already_added list
        already_added.append(name)

In [17]:
# let's look at the new android_clean dataset:
print(len(android_clean))
print(android_clean[0:10])

9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up'], ['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50,000

In [18]:
# let's test and make sure that we have the right records for a sample row
# list for checking the number of reviews for each entry
number_of_reviews = []

# test that I am pulling the max correctly
for app in android_clean:
    if app[0] == 'Quick PDF Scanner + OCR FREE':
        number_of_reviews.append(app[3])

In [19]:
print(number_of_reviews)

['80805']


Now, we have a Google Play store dataset without duplicates or impossible outliers.

# Data Cleansing Pt. 2 - Removing Non-English apps

Another requirement we have for our app analysis is to only consider apps that are directed towards an English-speaking audience since the company develops apps targeted towards this audience. To do this, we'll look at the app names in our dataset, and we'll remove any apps that have a letter or symbol that is not found in the English language per the ASCII codes 0 - 127.

In [20]:
# Function for iterating through an app name and checking to see if the ASCII code is less than or equal to 127
def is_app_likely_english(app_name):
    
    # initialize a non-English letter counter
    non_english_letters = 0
    
    # look through the app_name
    for i in app_name:
        # if the ASCII code is not less than or equal to 127, return false, else return true
        if ord(i) > 127:
            non_english_letters += 1
        
    # if there are more than 3 non-English letters, then return False. Else return True
    if non_english_letters > 3:
        return False
    else:
        return True

In [21]:
# test the function above on a few examples
print(is_app_likely_english('Instagram'))
print(is_app_likely_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_app_likely_english('Docs To Go™ Free Office Suite'))
print(is_app_likely_english('Instachat 😜'))

True
False
True
True


Now let's clean both our Google data set and the Apple App store data set to look at only apps with no more than three characters that that have an ASCII code greater than 127.

In [22]:
# go through the Google Play store apps and remove apps that have more than three characters with an ASCII code 
# greater than 127

# create a new list to append the English apps to
android_clean_english_only = []

# iterate through the android_clean list
for app in android_clean:
    
    # assign the name value to a variable
    name = app[0]
    
    # use the function above to determine if the app uses mainly English letters according to ASCII codes
    # if so, add the name of the app to the cleaned_android_english_only list
    if is_app_likely_english(name) == True:
        android_clean_english_only.append(app)

In [23]:
# check the total amount of records and test that a sample of names are coming correctly from the above dataset
print(len(android_clean_english_only))
print(android_clean_english_only[0][0])

9614
Photo Editor & Candy Camera & Grid & ScrapBook


In [24]:
# go through the Apple store apps and remove apps that have more than three characters with an ASCII code 
# greater than 127

# create a new list to append the English apps to
apple_clean_english_only = []

# iterate through the android_clean list
for app in apple_data_set[1:]:
    
    # assign the name value to a variable
    name = app[2]
    
    # use the function above to determine if the app uses mainly English letters according to ASCII codes
    # if so, add the name of the app to the cleaned_android_english_only list
    if is_app_likely_english(name) == True:
        apple_clean_english_only.append(app)

In [25]:
# check the total amount of records and test that a sample of names are coming correctly from the above dataset
print(len(apple_clean_english_only))
print(apple_clean_english_only[0][2])

6183
PAC-MAN Premium


# Data Cleansing Pt. 3 - Removing non-free-to-download apps

Another requirement we have for our app analysis is to only consider apps that are free to download as the company only builds free apps that rely on in-app purchases. To do this, we'll look remove any apps that have a price greater than $0.00 from our dataset.

In [26]:
# NOTE: May want to check to make sure that there is no type "Free" with a price > $0.00
# Let's loop through the google dataset and see if there are any type "Free" with a price > $0.00
# NOTE: I know that there is a "Free" type from looking at the sample of the Google Play Store dataset above
def check_type(dataset):
    for i in dataset:
        if i[6] == "Free" and float(i[7]) > 0:
            print("Yeah, don't trust the Type column")
        
    # if we don't print the above statement, then print the below statement since we didn't find a price > 0 for a Free type
    print("Actually, you can trust the Type column, but still use the price column.")


check_type(android_clean_english_only)

Actually, you can trust the Type column, but still use the price column.


In [27]:
# Create a list for the Google dataset to only pull those apps that are free
android_clean_english_and_free_only = []
# Now let's loop through the Google dataset to only get those that are free
for app in android_clean_english_only:
    # make a variable for the price
    price = app[7]
    
    # check to see if the price is free. If so, append it to the new list
    if price == '0':
        android_clean_english_and_free_only.append(app)

In [28]:
# Test out to make sure that we are only getting free apps by checking a sample
print(len(android_clean_english_and_free_only))

print(android_clean_english_and_free_only[0:5])

8862
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


In [29]:
# Create a list for the Apple dataset to only pull those apps that are free
apple_clean_english_and_free_only = []
# Now let's loop through the Google dataset to only get those that are free
for app in apple_clean_english_only:
    # make a variable for the price
    price = app[5]
    
    # check to see if the price is free. If so, append it to the new list
    if price == '0':
        apple_clean_english_and_free_only.append(app)

In [30]:
# Test out to make sure that we are only getting free apps by checking a sample
print(len(apple_clean_english_and_free_only))

print(apple_clean_english_and_free_only[0:5])

3222
[['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1'], ['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1'], ['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1'], ['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1'], ['7', '283646709', 'PayPal - Send and request money safely', '227795968', 'USD', '0', '119487', '879', '4', '4.5', '6.12.0', '4+', 'Finance', '37', '0', '19', '1']]


# Looking at which Genres Are Most Popular
Since we are defining success as making apps that generate enough revenues in both the Google Playstore and the Apple App Store to give us a profit, we should look at which genres are most popular since these are likely to bring in the most users, and more users could equal more chances to convert an in-app purchase.

From looking at the Google Playstore dataset, the following columns look like they may be helpful:
- Category
- Genres

From looking at the Apple App Store dataset, the prime_genre column looks like it may be helpful

In [31]:
# Let's create a frequency table function to see which genres are most popular
# This function takes in a list of lists and an index
def freq_table(dataset, index):
    
    # create a dictionary variable that we will return as part of this function
    freq_dict = {}
    
    # Now create the frequency table based on the index provided above
    # Loop through the dataset
    for i in dataset:
        
        # set the index of the i to a variable
        value = i[index]
        
        # check to see if the index's value from the dataset exists in the freq_dict
        if value in freq_dict:
            # if so, increment the key's value by 1
            freq_dict[value] += 1
        # else just set the key's value to 1
        else:
            freq_dict[value] = 1
            
    # now let's convert each of the key values (number of apps within that genre) to a percentage of total apps
    # use a for loop to iterate through the different keys
    for genre in freq_dict:
        
        # set the key value to a percentage of total apps
        freq_dict[genre] = (freq_dict[genre] / len(dataset)) * 100
    
    # now return the dictionary
    return freq_dict

In [32]:
# Test out the above function on the cleaned Google dataset
google_freq_dict_category = freq_table(android_clean_english_and_free_only,1)
# print out the result
print(google_freq_dict_category)

{'ART_AND_DESIGN': 0.6431956668923494, 'AUTO_AND_VEHICLES': 0.9252990295644324, 'BEAUTY': 0.598059128864816, 'BOOKS_AND_REFERENCE': 2.143985556307831, 'BUSINESS': 4.5926427443015125, 'COMICS': 0.6206273978785828, 'COMMUNICATION': 3.238546603475513, 'DATING': 1.8618821936357481, 'EDUCATION': 1.1735499887158656, 'ENTERTAINMENT': 0.9591514330850823, 'EVENTS': 0.7109004739336493, 'FINANCE': 3.7011961182577298, 'FOOD_AND_DRINK': 1.2412547957571656, 'HEALTH_AND_FITNESS': 3.080568720379147, 'HOUSE_AND_HOME': 0.8237418190024826, 'LIBRARIES_AND_DEMO': 0.9365831640713158, 'LIFESTYLE': 3.9043105393816293, 'GAME': 9.693071541412774, 'FAMILY': 18.934777702550214, 'MEDICAL': 3.5206499661475967, 'SOCIAL': 2.663055743624464, 'SHOPPING': 2.2455427668697814, 'PHOTOGRAPHY': 2.945159106296547, 'SPORTS': 3.39652448657188, 'TRAVEL_AND_LOCAL': 2.335815842924848, 'TOOLS': 8.451816745655607, 'PERSONALIZATION': 3.3175355450236967, 'PRODUCTIVITY': 3.8930264048747465, 'PARENTING': 0.6544798013992327, 'WEATHER': 0

In [33]:
# create the display table function to make viewing the frequency dictionary easier
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [34]:
# use the display table function above for the Category column on the Google Play store dataset
display_table(android_clean_english_and_free_only,1)

FAMILY : 18.934777702550214
GAME : 9.693071541412774
TOOLS : 8.451816745655607
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.1735499887158656
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
ART_AND_DESIGN : 0.

In [35]:
# use the display table function above for the Genres column on the Google Play store dataset
display_table(android_clean_english_and_free_only,9)

Tools : 8.440532611148726
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5206499661475967
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7490408485669149
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

In [36]:
# use the display table function above for the prime_genre column on the Apple App Store dataset
display_table(apple_clean_english_and_free_only,12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


# App Store prime_genre Analysis
From looking at the count of free English only apps on the App Store by the column "prime_genre," we see the following:

- "Games" has the most apps at 1874, followed by "Entertainment" at 254
- It looks like the "Games" genre is overwhelmingly the largest category before it significantly drops off at the next category. I would also assume "Games" could fall under "Entertainment" as well
- It appears that most of the apps in this list is geared more towards fun than education (though the two aren't necessarily mutually exclusive)
- Based on the genre counts here, and assuming that a higher count of apps signals a need for a higher supply to meet a high demand for these apps, it looks like making an app that could be categorized as a "Game" would be the way to go. This could make sense as I could see how games and other fun apps are great for killing time*

*Caveat: I am assuming here that more apps in a category equals more popular and in demand which would equal more downloads. However, the larger number of apps in the "Games" category could be an indication of what the developers ultimately like developing rather than what the consumers ultimately want. What would be useful is a number of downloads column, and then a percentage of total downloads which would help us be more confident in knowing which genres are more popular. 

# Google Play Store "Genres" Analysis
From looking at the count of free English only apps on the Google Play Store by the column "Genres," we see the following:

- The "Tools" genre is the most common genre at 748, followed by Entertainment at 538
- It appears that the number of apps by genre here is more evenly distributed than the App Store genres - the difference between the most common genre and the second most common genre on the Google Play Store is much smaller than the difference in the amount of apps in the top two genres on the App Store dataset
- In addition, it looks like there are sub-genres for some of the apps where it further breaks down the "Education" genre for instance. To get a better view of the number of apps per genre, it would be a good exercise to recategorize these apps into their parent genre.

# Google Play Store "Category" Analysis
From looking at the count of free English only apps on the Google Play Store by the column "Category," we see the following:

- The "FAMILY" category is the most common category at 1678, followed by GAME at 859 and then TOOLS at 749
- For the most part, except for the FAMILY and GAME category, the category numbers seem to line up with the genres numbers (Category "TOOLS" has 749 where the genre "Tools" has 748, Category "PRODUCTIVITY" and genre "Productivity" both have 345, among other examples)

# Looking at both datasets, and what we should focus our dev on

From looking at both the Genres and Category columns of the Google dataset, it looks like a Game meant for entertainment purposes would fit in with the more common free English only apps in both the App Store and the Google Play Store.

That said, please see my caveat in the Apple App Store dataset analysis above. We are assuming that more apps in a category equals more popular and in demand which would equal more downloads. However, the larger number of apps in the "Games" category could be an indication of what the developers like developing rather than what the consumers actually want. What would be useful is a number of downloads column, and then a percentage of total downloads which would help us be more confident in knowing which genres are more popular. 

# Downloads/Installs Analysis

As mentioned in the caveat above, it would be helpful to see the number of downloads per genre to get a better idea of which genre is more popular. The Google Play Store dataset has a column that gives us the total downloads called "Installs" that will be useful for this analysis. However, the Apple App Store dataset has no such column. However, since the Apple App Store dataset has a user ratings column, we can assume that those who have provided a rating have downloaded the app, so we will use the number of user ratings in lieu of an exact downloads number.

# Average Number of Installs (User Ratings) per Apple App Store Genre
As mentioned above, let's calculate the average number of user ratings per Apple App Store "prime_genre" to see which genre has the most average users per app in the genre:

In [37]:
# First, let's use our freq_table function to create a new dictionary variable
# This dictionary variable will have each of the prime_genres as keys
genre_dict = freq_table(apple_clean_english_and_free_only,12)

# From there, let's get the number of user ratings for each genre
# start a for loop going through each genre
for genre in genre_dict:
    
    # initialize a total variable for getting the total user ratings
    total = 0
    
    # intialize a total variable for getting the number of app appearances:
    len_genre = 0
    
    # loop through the dataset, and with the genre as the key, get the total number of user ratings
    for app in apple_clean_english_and_free_only:
        
        # get the genre as a variable
        genre_app = app[12]
        
        # get the number of user ratings as a variable
        rating_count_tot_app = float(app[6])
        
        # now check to see if the genre_app = the genre from our unique genre's dictionary
        if genre_app == genre:
            # total user ratings will be incremented by the user ratings for that app
            total = total + rating_count_tot_app
            # total times the app appears will be incremented by 1
            len_genre += 1
            
    # now let's calculate the average user ratings per genre by taking the total user ratings for a genre...
    # ...divided by the total apps within that genre
    average_user_ratings = float(total) / float(len_genre)
    
    # now let's print out the genre and the average user ratings per genre
    print(genre)
    print(average_user_ratings)
    print("")
    

Productivity
21028.410714285714

Weather
52279.892857142855

Shopping
26919.690476190477

Reference
74942.11111111111

Finance
31467.944444444445

Music
57326.530303030304

Utilities
18684.456790123455

Travel
28243.8

Social Networking
71548.34905660378

Sports
23008.898550724636

Health & Fitness
23298.015384615384

Games
22788.6696905016

Food & Drink
33333.92307692308

News
21248.023255813954

Book
39758.5

Photo & Video
28441.54375

Entertainment
14029.830708661417

Business
7491.117647058823

Lifestyle
16485.764705882353

Education
7003.983050847458

Navigation
86090.33333333333

Medical
612.0

Catalogs
4004.0



# Analysis of Average Installs per Genre (Apple App Store)

From looking at the above average installs per genre in the App Store, it looks like Navigation apps have the most user ratings. Although this could indicate that Navigation apps could be home to some of the most downloaded apps, I assume the higher user ratings just has to do with more folks being vocal about an app they more consistently use. I could see if, for instance, Apple Maps unfortunately not getting that much love. In fact, from checking the App Store now, I see that Apple Maps actually has no user ratings, which I find hard to believe as everyone should have the app on their phone, and I can't be the only one that was taken to the back entrance of where I was supposed to go when I used the app. From looking at Google Maps and Waze, I see 1.5 million reviews for each app, and they're both sitting comfortably at 5 stars. From this, I think the Navigation space would be hard to break into along with the next most rated genres of Social Networking, Music, and Weather for similar reasons. The "Reference" genre looks promising at first until you see that the Bible app and Dictionary.com are the Facebook and Google Maps of this genre as well.

Within the medium-level amount of user reviews, we have the following (in no particular order):
- Finance
- Travel
- Shopping
- Productivity
- Food & Drink
- Games
- Photo & Video
- News
- Sports

From looking at the above categories, I think the Finance genre also has some well-established apps that are owned by established banks. The same goes for Shopping, Food & Drink, News, Sports and Photo & Video with well-established companies having their information shared in app form. This then brings us full circle to the Games category - there may be a few heavy hitters like Angry Birds out there, but the space is more open for a small app to catch on. Travel and Productivity could also be looked into further.

# Average Number of Installs per Google Play Store Genre
Now, let's calculate the average number of Installs per app in the Google Play Store by "Category" to see which category has the most average app downloads:

In [41]:
# First, let's use our freq_table function to create a new dictionary variable
# This dictionary variable will have each of the Categories as keys
category_dict = freq_table(android_clean_english_and_free_only,1)

# From there, let's get the number of user ratings for each genre
# start a for loop going through each genre
for category in category_dict:
    
    # initialize a total variable for getting the total user ratings
    total = 0
    
    # intialize a total variable for getting the number of app appearances:
    len_category = 0
    
    # loop through the dataset, and with the genre as the key, get the total number of user ratings
    for app in android_clean_english_and_free_only:
        
        # get the genre as a variable
        category_app = app[1]
        
        # get the number of user ratings as a variable
        installs = app[5]
        
        # replace any '+' sign and any commas with a blank space
        installs = installs.replace('+','')
        installs = installs.replace(',','')
        
        # convert installs to a float
        installs = float(installs)
        
        # now check to see if the genre_app = the genre from our unique genre's dictionary
        if category_app == category:
            # total user ratings will be incremented by the user ratings for that app
            total = total + installs
            # total times the app appears will be incremented by 1
            len_category += 1
            
    # now let's calculate the average user ratings per genre by taking the total user ratings for a genre...
    # ...divided by the total apps within that genre
    average_installs = float(total) / float(len_category)
    
    # now let's print out the genre and the average user ratings per genre
    print(category)
    print(average_installs)
    print("")
    

ART_AND_DESIGN
1986335.0877192982

AUTO_AND_VEHICLES
647317.8170731707

BEAUTY
513151.88679245283

BOOKS_AND_REFERENCE
8767811.894736841

BUSINESS
1712290.1474201474

COMICS
817657.2727272727

COMMUNICATION
38456119.167247385

DATING
854028.8303030303

EDUCATION
1820673.076923077

ENTERTAINMENT
11640705.88235294

EVENTS
253542.22222222222

FINANCE
1387692.475609756

FOOD_AND_DRINK
1924897.7363636363

HEALTH_AND_FITNESS
4188821.9853479853

HOUSE_AND_HOME
1331540.5616438356

LIBRARIES_AND_DEMO
638503.734939759

LIFESTYLE
1437816.2687861272

GAME
15560965.599534342

FAMILY
3694276.334922527

MEDICAL
120616.48717948717

SOCIAL
23253652.127118643

SHOPPING
7036877.311557789

PHOTOGRAPHY
17805627.643678162

SPORTS
3638640.1428571427

TRAVEL_AND_LOCAL
13984077.710144928

TOOLS
10682301.033377837

PERSONALIZATION
5201482.6122448975

PRODUCTIVITY
16787331.344927534

PARENTING
542603.6206896552

WEATHER
5074486.197183099

VIDEO_PLAYERS
24727872.452830188

NEWS_AND_MAGAZINES
9549178.467741935

MA

# Analysis of Average Installs per Genre (Google Play Store)

Knowing what we know of what does well in the Apple App Store, we have the following categories with the most average downloads (in no particular order):

- TRAVEL_AND_LOCAL
- PRODUCTIVITY
- GAME
- PHOTOGRAPHY
- VIDEO_PLAYERS
- NEWS_AND_MAGAZINES
- COMMUNICATION

From looking at the above categories, we also see GAMES are heavily downloaded on the Google Play Store as well.

# Conclusion
When looking through our datasets to see which genre of free and English-only apps we should be developing based on what is already heavily downloaded, it's tempting to just develop what has the most downloads or is the most reviewed. However, as we saw with the Social Media and Navigation categories, there can be a few categories with a lot of downloads thanks to one or two apps that are not going anywhere anytime soon. As a result, it would be better to go into a popular category where it's possible to make a variety of apps that have a chance of catching an audience. We narrowed down our categories to ones which meet this criteria, and taking personal interest and experience into account with Games, it would be easier to churn out different games that can help people pass time and maybe let our developers have some fun making them.