# App Profile Recommendation
This project analyzes free-to-download apps in both Google Play and the App Store in order to see which types of apps attract the most English-speaking users. More users means more revenue, as the sole source of free-to-download apps are usually in-app ads and in-app purchases.

### Opening and Exploring the Data
As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play. We will be using 2 files that contain a sample of 10,000 Android apps (googleplaystore.csv) and 7,000 iOS apps (AppleStore.csv). We will use the below explore_data function to help us explore these datasets

In [1]:
# All of our imports here
from csv import reader

In [2]:
# This function allows us to print rows of a dataset in a readable way. 
# This function assumes that dataset does not include the header row.
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

explore_data requires the dataset argument to be a list of lists. 
Below, we open the two files and create list of lists objects to represent the data in those files.

In [3]:
with open('AppleStore.csv', encoding='utf8') as apple_file:
    apple_reader = reader(apple_file)
    apple_data = list(apple_reader)
    
with open('googleplaystore.csv', encoding='utf8') as google_file:
    google_reader = reader(google_file)
    google_data = list(google_reader)

Let's check to make sure we correctly imported the data by looking at the first few rows, and checking the size.

In [4]:
explore_data(apple_data, 0, 10, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061

In [5]:
explore_data(google_data, 0, 10, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Eve

Let's look at the columns and see if any in particular could help us identify what types of apps attract more users. First, for the App Store (Apple: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps). I have included links to the original dataset for more detailed descriptions of the columns.

In [6]:
print(apple_data[0])

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


It looks like there are two types of metrics that would be useful. One type of metric describes user engagement. This includes columns like rating_count_tot and user_rating. The other type of metric describes the content of app. This includes columns like cont_rating and prime_genre. Other columns like price will be important, as we will want to filter out the apps that are not free since they are out of the scope of this experiment.

We do the same analysis on the Google Data (https://www.kaggle.com/lava18/google-play-store-apps):

In [7]:
print(google_data[0])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Here, we can split the useful columns into the same 2 categories. For the user engagement columns, we have Rating, Reviews, and Installs. For the app content columns, we have Category, Content Rating, and Genres. Again, we will use the Type column to filter out the apps that are not free.

Also note that these are just guesses. The analysis we do in this project may show that how up-to-date and frequently an app is patched/improved may also have a significant effect on user engagement.

### Deleting Wrong Data

Before we analyze the data, we have to clean it first (remove/correct wrong data, remove duplicate data, and/or modify data to fit the purpose of our analysis). Remember that we only want to look at apps that are free to download and directed toward an English-speaking audience. This means we need to:

1. Remove non-English apps.
2. Remove apps that aren't free.

The Kaggle discussion board that this dataset came from also indicates that there is a wrong rating for entry 10472: https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015. The 'Rating' is missing and the rest of the columns are shifted as a result. We will check this row.

In [8]:
print(google_data[10472])
print(google_data[10473])

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


I printed out the rows at index 10472 and 10473 because I wasn't sure if the user who found this error was including the header row or not. It looks like the row at index 10473 is missing its 'Category' value. We will delete this (run this only once, or it will delete the next row as well):

In [9]:
del google_data[10473]

Let's check that the row was deleted:

In [10]:
print(google_data[10472])
print(google_data[10473])

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
