# App Analysis Project

This project will analyze app data from both Android and Apple apps that are sold in Google Play and the App Store. I am basing this analysis on the assumption that app developers will provide their apps for free and earn revenue by selling ads. Therefore, I want to find what type of apps are likely to attract more users across both the Apple and Android platforms.

## Opening and Exploring Data

First, I will open the data sets 'AppleStore.csv' and 'googleplaystore.csv', which were obtained from Kaggle, convert them to a list of lists, and perform some exploratory data analysis.

The link to the Apple data set documentation is here: [link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

The link to the Android data set documentation is here: [link](https://www.kaggle.com/lava18/google-play-store-apps).

In [1]:
from csv import reader

open_file_apple = open('AppleStore.csv')
open_file_android = open('googleplaystore.csv')

read_file_apple = reader(open_file_apple)
read_file_android = reader(open_file_android)

apple_data = list(read_file_apple)
android_data = list(read_file_android)

I have written a function that takes in a dataset and the starting and ending indices as inputs and then outputs each of the rows corresponding to those indices with a space between them. If you set the fourth parameter to True, it will also print the number of rows in the dataset by looking at the length of the list of lists and number of columns in the dataset by looking at the length of one row in the dataset.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(apple_data, 0, 3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']




In [4]:
explore_data(android_data, 0, 3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




Based on this initial look at the dataset, I can see that both datasets have headers. In order to prevent these headers being counted when trying to see how many apps each dataset contains, I will only look at the 2nd row (index number 1) and forward.

In [5]:
explore_data(apple_data[1:], 0, 1, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16


In [6]:
explore_data(android_data[1:], 0, 1, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


I learned that the Apple data set has data on 7,197 apps with 16 explanatory variables and that the Android data set has data on 10,841 apps with 13 explanatory variables.

## Variable Selection

I am now going to look at each column of the data set by looking at the header row to determine which columns would be useful for this analysis. 

In [7]:
apple_data[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [8]:
android_data[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

The columns for the Apple data set are: App ID, App Name, Size, Currency Type, Price, User Rating Counts (all versions), User Rating Counts (current version), Average User Rating (all versions), Average User Rating (current version), Latest Version, Content Rating, Primary Genre, Number of Supporting Devices, Number of Screenshots, Number of Supported Languages, and Vpp Device Based Licensing Enabled.

The columns for the Android data set are: App Name, Category, Overall User Rating, Number of Reviews, Size, Number of Installs, Type (paid or free), Price, Content Rating, Genres, Last Updated, Current Version, and Android Version. 

For both data sets, I think the most applicable columns are the app name, the price (since I only want to analyze free apps), and the language (since I am only interested in developing apps in the English language). In order to determine which apps have the most users, I could look at the number of installs or the number of user ratings. Then, based on that information, I can see what genres are correlated with free, English apps that have a lot of installs or ratings. 

## Data Cleaning

First, I will start by deleting out any erroneous data entries that have been pointed out by other users. There is one error pointed out in the Google Play dataset documentation that the "Category" value is missing for row 10,472 (row 10,473 if you are counting the header row), so I will check that that error indeed exists and then delete that row if it does.

In [9]:
print(android_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


There are only 12 entries in this row, when there should be 13 explanatory variables in each row, so I will delete this row.

In [10]:
del android_data[10473]

No errors were pointed out in the App Store dataset documentation, so I don't have anything to fix yet.

I will also check whether the data sets have any duplicate entries, starting with the Android data set. I will do this by looping over the dataset and for each row, adding the name to a blank list. If the name already exists in that list, then I will add it to a separate list of duplicate entries instead of adding it to the other list.

In [11]:
unique_apps_android = []
duplicate_apps_android = []
for row in android_data:
    name = row[0]
    if name in unique_apps_android:
        duplicate_apps_android.append(name)
    else:
        unique_apps_android.append(name)

print(duplicate_apps_android[0:3])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business']


In [12]:
print(len(duplicate_apps_android))

1181


The Android data set definitely has duplicate entries (1181 of them).

In order to ensure I remove only the older instances of duplicate entries, I will keep the entry with the highest number of user ratings and delete the duplicate entries with less ratings. To do this, I will create a blank dictionary, then loop through the data set and add each app name and corresponding number of reviews to the dictionary. If I come across an app name that has already been added to the dictionary, and if this row has more reviews than the existing dictionary entry, I will update the dictionary to use this number of reviews instead.

At the end, I verify that the number of entries in the dictionary equals the original 10,841 entries - 1 deleted entry - 1,181 duplicate entries = 9,659.

In [13]:
reviews_max = {}

for row in android_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))

9659


Now that I've created a dictionary that has one entry for each app name and the corresponding highest number of ratings it received, I need to update the data set to only have these rows and delete out the duplicate rows. For each row in the original data set (except the header row), I will append that row to a new list of lists if the app name and number of reviews has a corresponding entry in the data dictionary. This ensures that only the rows corresponding to a unique app name with the highest number of ratings is added to this cleaned data set.

In case there are multiple entries with the same number of reviews, I also need to make a list that contains all the app names that have already been added to this "cleaned" data set and not add any apps to the cleaned data set that have already been added already.

In [14]:
android_clean = []
already_added = []

for row in android_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        
print(len(android_clean))

9659


I will now check whether there are duplicate entries in the App Store dataset. I will follow the same process as above for the Google Play dataset, except in this case, I will check the App ID column instead of the App Name column for repetition.

In [15]:
unique_apps_apple = []
duplicate_apps_apple = []
for row in apple_data:
    id = row[0]
    if id in unique_apps_apple:
        duplicate_apps_apple.append(id)
    else:
        unique_apps_apple.append(id)

print(duplicate_apps_apple[0:3])

[]


This is an empty list, so there are no duplicate app entries in the App Store dataset.

Next, I will remove any apps that are not in the English language. I will do this by creating a function that returns True if an app name does contain all English characters and False if it does not. To do this, I will loop over each character in the app name and if any of the characters are associated with a number greater than 127, then according to ASCII, this is not a character commonly used in English text, so False is returned. Otherwise, if all of the characters have associated numbers between 0 and 127, True will be returned.

In [16]:
def english_char(app_name):
    for character in app_name:
        if ord(character) > 127:
            return False
    return True

print(english_char('Instagram'))
print(english_char('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_char('Docs To Go™ Free Office Suite'))
print(english_char('Instachat 😜'))

True
False
False
False


Testing this on a few different potential app names shows that the does not work properly on the last two examples because they contain special characters like a trademark symbol or emoji. I will therefore modify the function to only return False (e.g. app name is non-English) if more than 3 characters have an associated value greater than 127. This isn't a perfect solution, but it at least prevents any app name with up to three emojis or other special characters that is otherwise in English from being mistakenly deleted.

I will do this by initializing a count variable that is incremented by 1 for each character in an app name that has an associated value greater than 127. If this count variable has count greater than three for a given app name, the function will return False; otherwise, it will return True.

In [17]:
def english_char(app_name):
    num_non_english = 0
    for character in app_name:
        if ord(character) > 127:
            num_non_english += 1
    if num_non_english > 3:
        return False
    return True

print(english_char('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_char('Docs To Go™ Free Office Suite'))
print(english_char('Instachat 😜'))

False
True
True


We can see that the function identifies the last two examples as English names now. 

I will now filter out all non-English apps from both data sets using this function. I will create two new blank lists. I will then loop through each row in the data set (using the cleaned data set without duplicate entries for the Google Play dataset), and if the function returns True on the app name being English, I will add it to the blank list. This means only the apps with English names will be in these new lists.

In [18]:
android_data_english = []
apple_data_english = []
for row in android_clean:
    name = row[0]
    if english_char(name):
        android_data_english.append(row)
for row in apple_data[1:]:
    name = row[1]
    if english_char(name):
        apple_data_english.append(row)
        
print(len(android_data_english))
print(len(apple_data_english))

9614
6183


There are now 9,614 apps in the Google Play data set that are not duplicates and have English names. There are 6,183 apps in the Apple data set that are not duplicates and have English names.

As the final step in the data cleaning process, I am going to remove the non-free apps, since I am only interested in analyzing free apps. I will do this by creating two blank lists again. I will loop through each row in the dataset, identify the price, and only append that row to the new list if it has a price equal to 0. 

At first, I received an error on the Google Play dataset; the error stated that the price $4.99 could not be converted to a float. Looking at the dataset documentation on Kaggle, I noticed that all of the non-zero prices had a dollar sign in front of them and thus could not be converted to a float, so I instead left the prices as a string and added the prices with a value of '0' (string) to the blank list. The prices in the App Store dataset did not have dollar signs in front of them and were thus able to be converted to a float without issue.

In [19]:
android_data_free = []
apple_data_free = []

for row in android_data_english:
    price = row[7]
    if price == '0':
        android_data_free.append(row)
for row in apple_data_english:
    price = float(row[4])
    if price == 0:
        apple_data_free.append(row)
        
print(len(android_data_free))
print(len(apple_data_free))
        

8864
3222


The result is a total of 8,864 free, English apps in the Google Play dataset and 3,222 free, English apps in the App Store data set. This is a good place to stop data cleaning for the purposes of this project and move on to data analysis.

## Data Analysis

Now that I have cleaned the data, I will begin the analysis by looking at common genres of app. I will look at common genres in both the Google Play dataset and the App Store dataset based on a planned development strategy of developing a low cost app for Google Play, then if it gets a good response from users, developing it for the App Store. The goal is to find out what genres of app are most popular in order to focus on developing that type of app.

Based on the available data, I will look at the Primary Genre column of the App Store data set (index number 11) and both the Category and Genres columns (index numbers 1 and 9, respectively) of the Google Play dataset.

To perform this analysis, I will build functions that allow me to pass in a dataset and index number for a column and return the percentage of apps that have each possible value in that column, sorted largest to smallest.

First, I will make a function that creates a frequency table of percentages. The function takes a dataset and index number as inputs and creates a blank dictionary. It then loops over each row in the dataset and finds the value at the provided index number, which corresponds to a particular column in the dataset. The function creates a new entry in the dictionary for each new value it encounters in that column and gives it an initial count of 1. If the column value it encounters already exists as a key in the dictionary, it instead increments the count by 1. This creates a dictionary of column values and the associated frequency that those values were observed. Finally, the function counts the number of rows in the dataset and divides each frequency by that total, then multiplies by 100 to come up with a percentage.

In [20]:
def freq_table(dataset, index):
    table = {}
    tot_num_apps = len(dataset)
    for row in dataset:
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    for key in table:
        table[key] /= tot_num_apps
        table[key] *= 100
    return table

Next, I will build a function that sorts the frequency table values in descending order in order to make it easier to read. It takes a dataset and index number as input, then creates a frequency table on that column using the function created above. It then loops over the frequency table, converting each key-value pair to a tuple with the value first and the key second, and then appends those tuples to a blank list. It sorts that list in reverse (descending) order based on the values, which are the frequency percentages, and then prints out all of the keys and values.

In [21]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
display_table(android_data_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [22]:
display_table(android_data_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

In [23]:
display_table(apple_data_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


After running this function on the three genre-related columns I want to analyze, it is clear that the most common genre in the App Store dataset is Games (58%), with the second most common being Entertainment at a much lower frequency (8%). The 4 most common genres account for 75% of the total apps in the dataset. Overall, the most common apps seem to be for entertainment purposes over practical purposes.

The Google Play dataset does not have as clear of a front-runner. The most common category is Family (19%), followed by Game (10%) and Tools (8%). The most common genres are Tools (8%) and Entertainment (6%). There are so many genres that each have a small percentage of the total number of apps that it makes that column not very useful for analysis. This dataset seems to be more split between entertainment apps and practical apps, with Tools ranking third in category and first in genre, as opposed to the App Store dataset, which leaned heavily towards entertainment apps.

This gives me a good idea of how many apps of each genre exist in both datasets. I cannot recommend a genre based on this frequency table, however, since it is possible that even though certain genres have a lot of apps in both datasets, those apps do not have many installs or ratings (and thus, not many users, which is what we care about).

In order to fix this problem, I will find out which genres have the most users. For the App Store data set, the best variable to use for this is the number of user ratings. While this is not exactly the same as looking at the total number of users, since not all users will leave ratings, it is the closest explanatory variable that I have to look at.

I will first save the frequency table that I generated above, which was in the form of a dictionary, to a variable. I will then loop over each genre in the frequency table, and within that loop, I will loop over each row of the data set. If the genre in that row is the same as the genre of the dictionary, I will add the total number of ratings for that app to a total variable and increment the number of apps in that genre by 1. Once I have looped through all of the rows of the data set and found all of the apps that have the genre I am looking for, I will divide the total number of ratings for that genre by the number of apps in that genre to get an average rating. I will continue to loop through this process for every possible genre in the dictionary.

In [24]:
apple_genres = freq_table(apple_data_free, 11)

for genre in apple_genres:
    total = 0
    len_genre = 0
    for row in apple_data_free:
        genre_app = row[11]
        if genre_app == genre:
            ratings = float(row[5])
            total += ratings
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, ':', avg_ratings)

Medical : 612.0
Music : 57326.530303030304
Shopping : 26919.690476190477
News : 21248.023255813954
Business : 7491.117647058823
Weather : 52279.892857142855
Health & Fitness : 23298.015384615384
Utilities : 18684.456790123455
Entertainment : 14029.830708661417
Reference : 74942.11111111111
Finance : 31467.944444444445
Book : 39758.5
Food & Drink : 33333.92307692308
Catalogs : 4004.0
Games : 22788.6696905016
Social Networking : 71548.34905660378
Productivity : 21028.410714285714
Photo & Video : 28441.54375
Education : 7003.983050847458
Navigation : 86090.33333333333
Sports : 23008.898550724636
Travel : 28243.8
Lifestyle : 16485.764705882353


The result of this analysis shows that the Games genre, which had by far the largest number of apps in the data set, actually does not have the highest average number of ratings. It has an average of 23K ratings per app, compared to Navigation, which has the highest average number of ratings at 86K per app. The lowest average number of ratings for a genre is Medical at 612 per app.

Based on the fact that the majority of apps in the App Store are entertainment apps as opposed to practical apps, I would recommend a Social Networking app, since that genre has a high average number of ratings (about 72K on average) and is also an entertainment app. 

I am now going to repeat this process on the Google Play dataset but look at the total number of installs instead of the number of user ratings. The number of installs is probably a better indicator of how many users the app actually has, but the App Store dataset did not have this variable. I am going to use the category column to determine the genre as opposed to the genre column, which did not provide as useful of results when I looked at the total number of apps.

One caveat with this analysis is that looking at the Kaggle documentation, I can see that the number of installs in the Google Play dataset is in increments such as 100,000+ or 1,000,000+, as opposed to being the exact number of installs. I will not be able to calculate an average number of installs for each genre of app unless the number of installs is able to be converted to a float or integer type. Therefore, I will replace any + signs or commas with a blank character, thereby deleting any + signs or commas. This means that the number of installs will be assumed to be the lowest in that range (e.g. 100,000+ installs is assumed to be 100,000 installs).

In [25]:
android_genres = freq_table(android_data_free, 1)

for genre in android_genres:
    total = 0
    len_genre = 0
    for row in android_data_free:
        genre_app = row[1]
        if genre_app == genre:
            installs = row[5]
            installs = installs.replace('+', '')
            installs = float(installs.replace(',', ''))
            total += installs
            len_genre += 1
    avg_installs = total / len_genre
    print(genre, ':', avg_installs)

PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
MAPS_AND_NAVIGATION : 4056941.7741935486
MEDICAL : 120550.61980830671
TOOLS : 10801391.298666667
COMMUNICATION : 38456119.167247385
TRAVEL_AND_LOCAL : 13984077.710144928
ART_AND_DESIGN : 1986335.0877192982
SHOPPING : 7036877.311557789
EDUCATION : 1833495.145631068
PERSONALIZATION : 5201482.6122448975
AUTO_AND_VEHICLES : 647317.8170731707
FOOD_AND_DRINK : 1924897.7363636363
SPORTS : 3638640.1428571427
ENTERTAINMENT : 11640705.88235294
HOUSE_AND_HOME : 1331540.5616438356
PHOTOGRAPHY : 17840110.40229885
DATING : 854028.8303030303
COMICS : 817657.2727272727
GAME : 15588015.603248259
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
PRODUCTIVITY : 16787331.344927534
NEWS_AND_MAGAZINES : 9549178.467741935
LIFESTYLE : 1437816.2687861272
VIDEO_PLAYERS : 24727872.452830188
LIBRARIES_AND_DEMO : 638503.734939759
HEALTH_AND_FITNESS : 4188821.9853479853
BOOKS_AND_REFERENCE : 8767811.894736841
FAMILY : 3695641.8198090694
BUSINESS : 1712

Looking at the results, I can see that Social apps have an average of over 23M installs per app, which is one of the highest average number of installs for any category (after Communication and Video Players). Also, there was no genre called Communication in the App Store data set, so apps that fell into that category might have just been labeled as Social Networking apps. A quick scan of both data sets appears to confirm this - some apps like WhatsApp and Skype are labeled as Communication in the Google Play data set but Social Networking in the App Store data set.

Therefore, I think that a Social Networking/Communication app is the best recommendation of a free, English language app to develop if your goal is to generate revenue through ads. I am going to do a little more analysis to see what specific area of Social Networking/Communication appears to have the best market opportunity.

To do this, I will create two new lists that only contain the Social Networking/Communication apps from the App Store and Google Play datasets and sort them from most installs or ratings to least installs or ratings. I will initialize two blank lists and then loop over each row in the data set, identifying the genre column, name column, and installs or ratings column. I will ensure that I convert the installs or ratings column to a float type so that it will sort correctly later on. I will then use an if statement to only add the apps with the genres I am looking for to the new lists, but only after first converting the data to a tuple with the number of installs or ratings first and the name of the app second. This will allow me to sort the data. 

When I am done creating the two new lists, I will sort them in descending order based on the number of installs or ratings. I will then see how many apps are in each list and look at the 10 apps with the highest number of installs or ratings in each list.

In [33]:
android_data_social = []
apple_data_social = []

for row in android_data_free:
    genre = row[1]
    installs = row[5]
    installs = installs.replace('+', '')
    installs = float(installs.replace(',', ''))
    name = row[0]
    if genre == "SOCIAL" or genre == "COMMUNICATION":
        tuple = (installs, name)
        android_data_social.append(tuple)
        
for row in apple_data_free:
    genre = row[11]
    ratings = float(row[5])
    name = row[1]
    if genre == "Social Networking":
        tuple = (ratings, name)
        apple_data_social.append(tuple) 

android_social_sorted = sorted(android_data_social, reverse = True)
apple_social_sorted = sorted(apple_data_social, reverse = True)

print(len(android_data_social))
print(len(apple_data_social))
print(android_social_sorted[0:10])
print(apple_social_sorted[0:10])

523
106
[(1000000000.0, 'WhatsApp Messenger'), (1000000000.0, 'Skype - free IM & video calls'), (1000000000.0, 'Messenger – Text and Video Chat for Free'), (1000000000.0, 'Instagram'), (1000000000.0, 'Hangouts'), (1000000000.0, 'Google+'), (1000000000.0, 'Google Chrome: Fast & Secure'), (1000000000.0, 'Gmail'), (1000000000.0, 'Facebook'), (500000000.0, 'imo free video calls and chat')]
[(2974676.0, 'Facebook'), (1061624.0, 'Pinterest'), (373519.0, 'Skype for iPhone'), (351466.0, 'Messenger'), (334293.0, 'Tumblr'), (287589.0, 'WhatsApp Messenger'), (260965.0, 'Kik'), (177501.0, 'ooVoo – Free Video Call, Text and Voice'), (164963.0, 'TextNow - Unlimited Text + Calls'), (164249.0, 'Viber Messenger – Text & Call')]


This shows that there are a total of 523 Social or Communications apps in the Google Play data set and a total of 106 Social Networking apps in the App Store data set. Looking at the apps that have the most installs in the Google Play data set, I can see that there are some that have over 1B installs, like Whats App, Facebook, and Instagram. Looking at the apps that have the most installs in the App Store data set, it is a lot of the same apps, but also some that I have never heard of before, like TextNow and Kik. 

In [35]:
print(android_social_sorted[100:110])
print(apple_social_sorted[30:40])

[(10000000.0, 'Hangouts Dialer - Call Phones'), (10000000.0, 'HTC Social Plugin - Facebook'), (10000000.0, 'GroupMe'), (10000000.0, 'Grindr - Gay chat'), (10000000.0, 'Google Voice'), (10000000.0, 'Google Allo'), (10000000.0, 'Glide - Video Chat Messenger'), (10000000.0, 'GO Notifier'), (10000000.0, 'GMX Mail'), (10000000.0, 'Free phone calls, free texting SMS on free number')]
[(23530.0, 'SimSimi'), (23201.0, 'Grindr - Gay and same sex guys chat, meet and date'), (20649.0, 'Wishbone - Compare Anything'), (18841.0, 'imo video calls and chat'), (18482.0, 'After School - Funny Anonymous School News'), (17694.0, 'Quick Reposter - Repost, Regram and Reshare Photos'), (16772.0, 'Weibo HD'), (15185.0, 'Repost for Instagram'), (14724.0, 'Live.me – Live Video Chat & Make Friends Nearby'), (14402.0, 'Nextdoor')]


Looking further down in the dataset, I can see that there are a lot of different apps in this category that are not the major players like Facebook or Instagram, but they still generate a lot of installs or reviews. This confirms that this is a good potential market opportunity. One option is to create a niche Social Networking app like Grindr that appeals to a certain subset of people. Another option is to create a chat/messenger app that has functionality beyond what the main existing apps like Hangouts, WhatsApp, and Messenger offer.

## Conclusion

Overall, the data sets appear to show that a good potential market opportunity to create a free app with ad-generated revenue is in the Social Networking or Communications areas. These apps are able to generate a large number of users, even though there is already a large market presence, due to the variety of different niches that are able to be explored. One option is to create a social networking app that is geared towards a certain subset of people - for instance, football fans - that provides content and interaction with other fans that they are not able to get on any other platform. There are numerous hobbies, groups of people with medical conditions, and other shared areas of interest that do not yet have a social platform to connect those people.