# App Profiles in the App Store

The point of this notebook is to utilize and demonstrate basic Python programming skills via an analysis of two datasets. This project should demonstrate my ability to do the following:
* Open and read a .csv file
* Utilize for loops and conditional statements
* Create and edit lists and dictionaries
* Write functions
* Maintain a Jupyter Notebook with commented code and easy-to-read mark-up

This project aims to help developers understand what sorts of apps are likely to attract users using two app store datasets: one from the Google Play Store and one from the Apple App Store. These datasets contain information about the prices, ratings, genres, and sizes of the apps. For the purposes of this project, we're assuming that our audience is a company that creates free-to-download apps in English.  
* For the Apple Store dataset, see the documentation [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).
* For the Google Play Store dataset, see the documentation [here](https://www.kaggle.com/lava18/google-play-store-apps).

## Opening & Reading a .csv

In [1]:
AppleStore = open('AppleStore.csv')
# opens Apple Store file

In [2]:
googleplaystore = open('googleplaystore.csv')
# opens Google Play Store file

In [3]:
from csv import reader
# imports a function for reading files from the Python csv library

In [4]:
apple_apps = list(reader(AppleStore))
google_apps = list(reader(googleplaystore))
#reads both files in as lists of lists

In [5]:
def explore_data(dataset, start, end, rows_and_columns=True, header=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of columns:', len(dataset[0]))
        if header:
            print('Number of rows:', len(dataset[1:]))
        else:
            print('Number of rows:', len(dataset))
#creates a function that prints the row contents for a slice of the dataset, as well as the number of rows and columns if rows_and_columns = True

In [6]:
explore_data(apple_apps,1,4, header=True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of columns: 16
Number of rows: 7197


In [7]:
explore_data(google_apps,1,4, header=True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of columns: 13
Number of rows: 10841


In [8]:
print('Apple Dataset Columns:', apple_apps[0])
print('\n')
print('Google Dataset Columns:', google_apps[0])
#prints the header rows for both datasets

Apple Dataset Columns: ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Google Dataset Columns: ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Cleaning Up the Data

The audience for this project is only interested in free-to-download, English-language apps. So first, I will need to filter out irrelevant info and apps.<br> <br>Discussions for the Google data have found errors in a particular row. The first step for our data cleaning will be to confirm this row has missing information and then, remove this row.

In [9]:
print(google_apps[10473])
#confirms missing row info by printing row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [10]:
del google_apps[10473]
#removes problem row

There are also several duplicate rows in both datasets. Removing them will be the next step in cleaning.

In [11]:
def find_dups(dataset, column_index, header=True):
    unique_rows=[]
    dup_rows=[]
    if header:
        dataset = dataset[1:]
    for row in dataset:
        cell = row[column_index]
        if cell in unique_rows:
            dup_rows.append(cell)
        else:
            unique_rows.append(cell)       
    return dup_rows
#creates a function that lists all the duplicate rows for a column

In [12]:
apple_dups = find_dups(apple_apps, 1)
google_dups = find_dups(google_apps, 0)
print('Apple duplicates:', apple_dups)
print('\n')
print('Google duplicates:', google_dups)
# assigns a list of duplicates for each dataset to its respective variable, then prints these lists

Apple duplicates: ['Mannequin Challenge', 'VR Roller Coaster']


Google duplicates: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express', 'Accounting App - Zoho Books', 'Invoice & Time Tracking - Zoho', 'join.me - Simple Meetings', 'Invoice 2go — Professional Invoices and Estimates', 'SignEasy | Sign and Fill PDF and other Documents', 'Quick PDF Scanner + OCR FREE', 'Genius Scan - PDF Scanner', 'Tiny Scanner - PDF Scanner App', 'Fast Scanner : Free PDF Scan', 'Mobile Doc Scanner (MDScan) Lite', 'TurboScan: scan documents and receipts in PDF', 'Tiny Scanner Pro: PDF D

In [13]:
print('# of Apple duplicates:', len(apple_dups))
print('\n')
print('# of Google duplicates:', len(google_dups))
#prints the number of duplicates in each dataset

# of Apple duplicates: 2


# of Google duplicates: 1181


In order to deal with duplicates, I will keep only the row with the highest number of reviews or ratings, since this is likely the most popular version of the app.

In [14]:
def find_most_reviewed(dataset, name_index, no_of_reviews_index, header=True):
    most_reviewed = {}
    if header:
        dataset = dataset[1:]
    for row in dataset:
        no_of_reviews_cell = float(row[no_of_reviews_index])
        name_cell = row[name_index]
        if name_cell in most_reviewed:
            if no_of_reviews_cell > most_reviewed[name_cell]:
                most_reviewed[name_cell] = no_of_reviews_cell
        else:
            most_reviewed[name_cell] = no_of_reviews_cell
    return most_reviewed
#creates a function that returns a dictionary with the highest number of reviews for each app in the dataset

In [15]:
apple_most_reviewed = find_most_reviewed(apple_apps, 1, 5)
google_most_reviewed = find_most_reviewed(google_apps, 0, 3)
#assigns the dictionaries generated by the find_most_reviewed function to their respective variables

In [16]:
def filter_dups(dataset, name_index, no_of_reviews_index, dictionary, header=True):
    clean_dataset = []
    if header:
        dataset = dataset[1:]
    for row in dataset:
        no_of_reviews = float(row[no_of_reviews_index])
        name = row[name_index]
        already_in = False
        for row_2 in clean_dataset:
            if row_2[name_index] == name:
                already_in = True
        if (not already_in and dictionary[name] == no_of_reviews):
                clean_dataset.append(row)
    return clean_dataset
#creates a function that returns a dataset with no duplicates, only the rows with the highest number of reviews for each app name are returned

In [17]:
clean_google = filter_dups(google_apps, 0, 3, google_most_reviewed)
clean_apple = filter_dups(apple_apps, 1, 5, apple_most_reviewed)
#assigns the lists of lists generated by the filter_dups function to their respective variables

There are also several non-English apps that need to be filtered out. Since some English apps contain emoji and other non-standard characters, we'll allow up to three non-standard English characters in the name of the app before it is excluded.

In [18]:
def filter_nonenglish_names(dataset, name_index, header=True):
    english_data=[]
    if header:
        dataset=dataset[1:]
    for row in dataset:
        name = row[name_index]
        non_english_chars = 0
        for char in name:
            if ord(char) > 127:
                non_english_chars += 1
        if non_english_chars <= 3:
            english_data.append(row)
    return english_data
#creates a function that iterates through the names column of a dataset and returns a dataset of apps with names that have only standard English characters

In [19]:
clean_eng_google = filter_nonenglish_names(clean_google, 0, header=False)
clean_eng_apple = filter_nonenglish_names(clean_apple, 1, header=False)
#assigns the lists of lists generated by the filter_nonenglish_names function to their respective variables

Finally, we're only interested in apps that are free, so any paid apps will need to be filtered out.

In [20]:
def filter_paid(dataset, price_index, header=True):
    free = []
    if header:
        dataset = dataset[1:]
    for row in dataset:
        price = row[price_index]
        price = str.replace(price,'$','')
        price = float(price)
        if price == 0:
            free.append(row)
    return free
#creates a function that returns a dataset with only rows that have a price of 0

In [21]:
final_google = filter_paid(clean_eng_google, 7, header=False)
final_apple = filter_paid(clean_eng_apple, 4, header=False)
#assigns the lists of lists generated by the filter_paid function to their respective variables

## Finding Relevant Insights

Our business strategy is to make a minimal Android app and put it in the Google Play Store. If there is a lot of interest from users, we will develop it further. Finally, if its profitable, we will make an Apple version and put it in the App Store. <br> <br>Because of this strategy, we need to identify qualities of apps that do well in both stores.<br> <br>The first step will be to get an idea of what the most common app genres are for both markets.

In [25]:
def freq_table(dataset, index, header=True):
    freqs = {}
    if header:
        dataset = dataset[1:]
    for row in dataset:
        column = row[index]
        if column in freqs:
            freqs[column] += 1
        else:
            freqs[column] = 1
    for key in freqs:
        freqs[key] = ((freqs[key])/len(dataset))*100
    return freqs
#a function that returns the percentage makeup of the entries of any column in a dataset as a dictionary

In [32]:
google_genre_freqs = freq_table(final_google, 9, header=False)
google_category_freqs = freq_table(final_google, 1, header=False)
apple_genre_freqs = freq_table(final_apple, 11, header=False)
#assigns the dictionaries generated by running the freq_table function on the genre categories of our data to their respective variables

In [36]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
#creates a function that displays the frequency tables generated by the freq_table function, in order of descending frequency

In [103]:
print(display_table(final_google, 9))
print('\n')
print(display_table(final_google, 1))
print('\n')
print(display_table(final_apple, 11))
#displays the tables generated by the display_table function to get an idea of how common the genres are

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

We can see from our tables that while the majority of Apple Apps fit in the games or entertainment categories, Android apps tend to be a blend of both tools and entertainment.<br> <br>Frequency of genres isn't the best indicator of popularity, however. To figure out how successful these genres are, we will need to look at downloads and ratings.

In [78]:
def count_ratings(dictionary, dataset, genre_index, ratings_index, header=True):
    if header:
        dataset=dataset[1:]
    averages = []
    for key in dictionary:
        total = 0
        len_genre = 0
        for row in dataset:
            genre = row[genre_index]
            if genre == key:
                ratings = float(row[ratings_index])
                total += ratings
                len_genre += 1
        average = total/len_genre
        average = average, key
        averages.append(average)
    return sorted(averages)
#creates a function that returns a list of sorted tuples of the average number of user ratings and the associated app genre

In [100]:
apple_ratings = count_ratings(apple_genre_freqs, final_apple, 11, 5, header=False)
apple_ratings
#assigns the result of the count_ratings function when used on the apple apps data to the apple_ratings variable, then returns it

[(612.0, 'Medical'),
 (4004.0, 'Catalogs'),
 (7003.983050847458, 'Education'),
 (7491.117647058823, 'Business'),
 (14029.830708661417, 'Entertainment'),
 (16485.764705882353, 'Lifestyle'),
 (18684.456790123455, 'Utilities'),
 (21028.410714285714, 'Productivity'),
 (21248.023255813954, 'News'),
 (22812.92467948718, 'Games'),
 (23008.898550724636, 'Sports'),
 (23298.015384615384, 'Health & Fitness'),
 (26919.690476190477, 'Shopping'),
 (28243.8, 'Travel'),
 (28441.54375, 'Photo & Video'),
 (31467.944444444445, 'Finance'),
 (33333.92307692308, 'Food & Drink'),
 (39758.5, 'Book'),
 (52279.892857142855, 'Weather'),
 (57326.530303030304, 'Music'),
 (71548.34905660378, 'Social Networking'),
 (74942.11111111111, 'Reference'),
 (86090.33333333333, 'Navigation')]

The average number of ratings for a Navigation app is over 86,000, while Medical apps average only 612 user ratings per app. <br> <br>If we compare this with our list of the most populated categories, we find that Social Networking is a popular category. Despite the large number of apps already populating this field, the average Social Networking app receives over 71,000 user ratings.<br> <br>There is, however, a lot of skew in this category, with apps like Facebook getting the lion's share of the ratings.

In [91]:
for row in final_apple:
    if row[1] == 'Facebook':
        Facebook_ratings = row[5]
        print(Facebook_ratings)
#Prints the number of ratings for the app Facebook

2974676


Instead of examining averages, then, it may be more useful to look at medians.

In [93]:
import statistics
#imports the statistics module

In [99]:
def find_genre_medians (dataset, dictionary, genre_index, ratings_index, header=True):
    if header:
        dataset = dataset[1:]
    medians = []
    for key in dictionary:
        genre = key
        ratings = dictionary[key]
        genre_list = []
        for row in dataset:
            ds_genre = row[genre_index]
            ds_ratings = row[ratings_index]
            if ds_genre==genre:
                genre_list.append(float(ds_ratings))
        med = (median(genre_list)), genre
        medians.append(med)
    return sorted(medians)
find_genre_medians (final_apple, apple_genre_freqs, 11, 5, header=False)
#creates a function that returns a list of the median number of ratings for each genre

[(289.0, 'Weather'),
 (373.0, 'News'),
 (421.5, 'Book'),
 (566.5, 'Medical'),
 (606.5, 'Education'),
 (798.5, 'Travel'),
 (904.0, 'Games'),
 (1110.0, 'Utilities'),
 (1111.0, 'Lifestyle'),
 (1150.0, 'Business'),
 (1197.5, 'Entertainment'),
 (1229.0, 'Catalogs'),
 (1490.5, 'Food & Drink'),
 (1628.0, 'Sports'),
 (1931.0, 'Finance'),
 (2206.0, 'Photo & Video'),
 (2459.0, 'Health & Fitness'),
 (3850.0, 'Music'),
 (4199.0, 'Social Networking'),
 (5936.0, 'Shopping'),
 (6614.0, 'Reference'),
 (8196.5, 'Navigation'),
 (8737.5, 'Productivity')]

We see that Shopping ranks high on both our list of the most common types of apps, and of the highest median ratings. Social Networking also ranks high on both. A shopping app with a social networking component would likely be popular in the Apple market. 

Our Android data includes number of installs, a less round-about way of determining market success.

In [107]:
display_table(final_google, 1)
#displays a table of the percentage of apps in each category

FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In [114]:
def find_installs (dataset, dictionary, category_index, installs_index, header=True):
    if header:
        dataset=dataset[1:]
    installations = []
    for category in dictionary:
        total = 0
        len_category = 0
        for row in dataset:
            app_category = row[category_index]
            installs = row[installs_index]
            if app_category == category:
                app_installs = installs
                app_installs = str.replace(app_installs,'+','')
                app_installs = str.replace(app_installs,',','')
                total += float(app_installs)
                len_category += 1
        avg_installs = float(total/len_category)
        avg_installs = avg_installs, category
        installations.append(avg_installs)
    return sorted(installations)

In [115]:
find_installs(final_google, google_category_freqs, 1, 5, header=False)

[(120550.61980830671, 'MEDICAL'),
 (253542.22222222222, 'EVENTS'),
 (513151.88679245283, 'BEAUTY'),
 (542603.6206896552, 'PARENTING'),
 (638503.734939759, 'LIBRARIES_AND_DEMO'),
 (647317.8170731707, 'AUTO_AND_VEHICLES'),
 (817657.2727272727, 'COMICS'),
 (854028.8303030303, 'DATING'),
 (1331540.5616438356, 'HOUSE_AND_HOME'),
 (1387692.475609756, 'FINANCE'),
 (1437816.2687861272, 'LIFESTYLE'),
 (1712290.1474201474, 'BUSINESS'),
 (1833495.145631068, 'EDUCATION'),
 (1924897.7363636363, 'FOOD_AND_DRINK'),
 (1986335.0877192982, 'ART_AND_DESIGN'),
 (3638640.1428571427, 'SPORTS'),
 (3695641.8198090694, 'FAMILY'),
 (4056941.7741935486, 'MAPS_AND_NAVIGATION'),
 (4188821.9853479853, 'HEALTH_AND_FITNESS'),
 (5074486.197183099, 'WEATHER'),
 (5201482.6122448975, 'PERSONALIZATION'),
 (7036877.311557789, 'SHOPPING'),
 (8767811.894736841, 'BOOKS_AND_REFERENCE'),
 (9549178.467741935, 'NEWS_AND_MAGAZINES'),
 (10801391.298666667, 'TOOLS'),
 (11640705.88235294, 'ENTERTAINMENT'),
 (13984077.710144928, 'TRAV