# App Analysis

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. 

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

First we read through the datasets ([Play Store](https://www.kaggle.com/lava18/google-play-store-apps), [App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)) and generate lists saved in `play_store_data` and `apple_store_data`.

In [2]:
from csv import reader

def open_file_generate_list(file):
    opened_file = open(file)
    read_file = reader(opened_file)
    return list(read_file)

play_store_data = open_file_generate_list('googleplaystore.csv')
apple_store_data = open_file_generate_list('AppleStore.csv')

We create a function `explore_data` to display rows from a dataset and display its total number of rows and columns.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The `play_store_data` dataset contains the following columns:

In [4]:
explore_data(play_store_data[:1], 0, 1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']




Below is some sample data from `play_store_data`, and the total number of rows and columns in the dataset.

In [5]:
explore_data(play_store_data[1:], 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


The `apple_store_data` dataset contains the following columns:

In [6]:
explore_data(apple_store_data[:1], 0, 1)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




Below is some sample data from `apple_store_data`, and the total number of rows and columns in the dataset.

In [7]:
explore_data(apple_store_data[1:], 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


We will now verify that all rows are accurately filled by comparing the number of elements in a row to the number of headers.

In [8]:
def incomplete_rows(dataset):
    header_size = len(dataset[0])
    errors = 0
    for row in dataset:
        if len(row) != header_size:
            print('[Deleted row at index %s] %s'%(dataset.index(row),row))
            errors += 1
            del dataset[dataset.index(row)]
    if errors == 0: print('No errors')

print('Play Store:')
incomplete_rows(play_store_data)
print('App Store:')
incomplete_rows(apple_store_data)

Play Store:
[Deleted row at index 10473] ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
App Store:
No errors


Checking for duplicate apps

In [9]:
def duplicate_and_unique_apps(dataset):
    duplicate_apps = []
    unique_apps = []
    for app in dataset:
        name = app[0]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    
    return duplicate_apps, unique_apps

play_store_duplicates = duplicate_and_unique_apps(play_store_data[1:])

print('Number of duplicates in Play Store')
print(len(play_store_duplicates[0]))

apple_store_duplicates = duplicate_and_unique_apps(apple_store_data[1:])

print('Number of duplicates in Apple Store')
print(len(apple_store_duplicates[0]))

Number of duplicates in Play Store
1181
Number of duplicates in Apple Store
0


Next we analyze an example of app duplicates to build a criterion for removing the duplicates.

In [10]:
for app in play_store_data:
    name = app[0]
    if name == play_store_duplicates[0][0]:
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


We can assume that the higher the number of reviews, the most recent the data should be. We will therefore keep the row with the highest number and remove the other entries.

To achieve this, we first create a dictionary `reviews_max` to store each unique app name and its highest `reviews` value.

In [15]:
reviews_max = {}

for app in play_store_data[1:]:
    name = app[0]
    reviews = int(app[3])
    if name not in reviews_max:
        reviews_max[name] = reviews
    if name in reviews_max and reviews > reviews_max[name]:
        reviews_max[name] = reviews

We then create a new list `android_clean` where we will store one row for each app - we only take the row with the highest `reviews` value for duplicate apps.

In [42]:
android_clean = []
already_added = []

for app in play_store_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name not in already_added and n_reviews == reviews_max[name]:
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


Because our company only focuses on English apps, we will remove all apps that are not directed toward an English-speaking audience.

In [55]:
def app_is_english(app_name):
    non_ASCII_count = 0
    for char in app_name:
        if ord(char) > 127:
            non_ASCII_count += 1
            if non_ASCII_count > 3: return False
    return True

def remove_non_english(data_set, new_data_set, index_for_name):
    for app in data_set:
        name = app[index_for_name]
        if app_is_english(name): new_data_set.append(app)

play_store_clean = []
app_store_clean = []

remove_non_english(android_clean, play_store_clean, 0)
remove_non_english(apple_store_data[1:], app_store_clean, 1) # App Store dataset still has headers

explore_data(play_store_clean, 0, 3, True)
print('\n')
explore_data(app_store_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

As stated initially, our company only focuses on building free apps so we will now remove all non-free apps from our datasets.

In [58]:
android_final = []
ios_final = []

for app in play_store_clean:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in app_store_clean:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
explore_data(android_final, 0, 3, True)
print('\n')
explore_data(ios_final, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 8864
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 