# Mobile App Analysis
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.  

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

Luckily, these are two data sets that seem suitable for our goals:

A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).  
A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from this [link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).

## Opening And Exploring Data

Let's start by opening and turn the two datasets into lists

In [1]:
from csv import reader
# open datasets
ios = open("AppleStore.csv")
android = open("googleplaystore.csv")

# convert datasets into list
ios_reader = list(reader(ios))
android_reader = list(reader(android))

# storing headers and data
ios_header = ios_reader[0]
ios_data = ios_reader[1:]

android_header = android_reader[0]
android_data = android_reader[1:]

Let's explore the 2 datasets with the function below:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

* **Exploring iOS Data**

In [3]:
ios_header


['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [4]:
explore_data(ios_data, 0, 4, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7197
Number of columns: 16


In total, there are **7197** apps and the first 4 rows should give us a general ideas of the data structure here.

* **Exploring Android Data**

In [5]:
android_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [6]:
explore_data(android_data, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In total, there are **10841** apps and the first 4 rows should give us a general ideas of the data structure here.

We can conclude that Android data header is somewhat verbose and gives us a sense of what the data is about.

 ## Removing Wrong Data

According to [this](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) on [**Kaggle**](wwww.kaggle.com) section, we are aware that the **10472nd** data has a missing column.

Let's revisit dataset header and what a normal dataset looks like

In [7]:
# dataset header
print(android_header, end="\n")

# normal dataset
print(android_data[0], end="\n")
print(len(android_data[0]), end="\n")

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']
13


Okay! Now let's check out the so-called incorrect dataset _**10472**_

In [8]:
# wrong dataset
print(android_data[10472])
print(len(android_data[10472]))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
12


We can clearly see that this dataset has a missing **Catergory** column. Therefore, we need to remove this dataset from our list of datasets by using `del`

In [9]:
# delete wrong dataset 10472
del(android_data[10472])

# check the current 10472nd entry
print(android_data[10472])

# check out the length
print(len(android_data))


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']
10840


In [10]:
print(len(android_data))

10840


We have now have removed the wrong dataset since the current size of the list of android datasets is now 10839 which is 1 less than the previous size. Also, the current 10472nd entry is now a different entry which has shifted left.

## Removing Duplicate Entries

If you explore the Google Play data set long enough or look at the discussions section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:

In [11]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In total, there are **1,181** cases where an app occurs more than once:

In [12]:
duplicate_apps = []
unique_apps = []

for app in android_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.

In [13]:
for app in android_data:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

On the previous screen, we looped through the Google Play data set and found that there are 1,181 duplicates. After we remove the duplicates, we should be left with **9,659** rows:

In [14]:
print('Expected length:', len(android_data) - 1181)

Expected length: 9659


To remove the duplicates, we will:

Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

In [15]:
reviews_max = dict()

# loop through Google Play Dataset
for app in android_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and float(reviews_max[name]) < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
# print the dictionary
print(len(reviews_max))

9659


## Removing non-English apps

Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

In [16]:
print(ios_data[813][1])
print(ios_data[6731][1])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the ord() built-in function.

In [17]:
print(ord('a'))
print(ord('A'))
print(ord('爱'))
print(ord('5'))
print(ord('+'))

97
65
29233
53
43


The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

In [18]:
# check if a string contains non-English character
def contain_non_eng_char(s):
    for char in s:
        if ord(char) > 127:
            return False
    return True

# test function
print(contain_non_eng_char('Instagram'))
print(contain_non_eng_char('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(contain_non_eng_char('Docs To Go™ Free Office Suite'))
print(contain_non_eng_char('Instachat 😜'))


True
False
False
False


We wrote a function that detects non-English app names, but we saw that the function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.

In [19]:
print(ord('™'))
print(ord('😜'))

8482
128540


If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

Let's edit the function we created previously, and then use it to filter out the non-English apps.

In [20]:
# check if a string contains non-English character
def contain_non_eng_char(s):
    # keep track of non-English character
    count = 0
    for char in s:
        if count == 4:
            return False
        if ord(char) > 127:
            count += 1
    return True

# test function
print(contain_non_eng_char('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(contain_non_eng_char('Docs To Go™ Free Office Suite'))
print(contain_non_eng_char('Instachat 😜'))

False
True
True


Let's edit the function we created in the previous screen, and then use it to filter out the non-English apps.

In [21]:
# create android and ios lists that only contains English app
clean_android_apps = []
clean_ios_apps = []

# append English android apps to the new list
for app in android_data:
    name = app[0]
    if contain_non_eng_char(name):
        clean_android_apps.append(name)
        
# append English ios apps to the new list
for app in ios_data:
    name = app[1]
    if contain_non_eng_char(name):
        clean_ios_apps.append(name)

Now, we can check the length of lists after filtering out only English apps.

In [22]:
# Length of original lists
print('Size of original android list: ' + str(len(android_data)))
print('Size of original ios list: ' + str(len(ios_data)))
        
# Lengths of clean lists    
print('Size of cleaned android list: ' + str(len(clean_android_apps)))
print('Size of cleaned ios list: ' + str(len(clean_ios_apps)))

Size of original android list: 10840
Size of original ios list: 7197
Size of cleaned android list: 10797
Size of cleaned ios list: 6226


We can see that after cleansing, the size of android app list is **10797** and that of ios list is **6226**.

## Isolating the free apps

First, let's create 2 lists that contain free apps for both Android and iOS

In [23]:
# Create free android apps and free ios apps lists
free_android_apps = []
free_ios_apps = []

# Append free android app into the new list
for app in android_data:
    is_free = app[6] == 'Free'
    if is_free:
        free_android_apps.append(app)

# Append free ios app into the new list
for app in ios_data:
    is_free = app[4] == '0.0'
    if is_free:
        free_ios_apps.append(app)        

Now, we can compare the number of free apps to the number of total apps.

In [24]:
# Length of original lists
print('Size of original android list: ' + str(len(android_data)))
print('Size of original ios list: ' + str(len(ios_data)))
        
# Lengths of clean lists    
print('Size of free android list: ' + str(len(free_android_apps)))
print('Size of free ios list: ' + str(len(free_ios_apps)))

Size of original android list: 10840
Size of original ios list: 7197
Size of free android list: 10039
Size of free ios list: 4056


We can see that after cleansing, the size of android app list is **10039** and that of ios list is **4056**.

## Most common apps by genre

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.  
To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we demvelop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store. 

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.  
Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

In [34]:
android_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [39]:
android_data[100]

['Natural recipes for your beauty',
 'BEAUTY',
 '4.7',
 '1150',
 '9.8M',
 '100,000+',
 'Free',
 '0',
 'Everyone',
 'Beauty',
 'May 15, 2018',
 '4.0',
 '4.1 and up']

In [50]:
# Building table of frequeancy for genres
android_genre_freqs = {}
for app in android_data:
    genre = app[1]
    installs = float(app[5].replace(",","").replace("+",""))
    if genre in android_genre_freqs:
        android_genre_freqs[genre] += installs
    else:
        android_genre_freqs[genre] = installs
        
        
    

In [51]:
android_genre_freqs

{'ART_AND_DESIGN': 124338100.0,
 'AUTO_AND_VEHICLES': 53130211.0,
 'BEAUTY': 27197050.0,
 'BOOKS_AND_REFERENCE': 1921469576.0,
 'BUSINESS': 1001914865.0,
 'COMICS': 56086150.0,
 'COMMUNICATION': 32647276251.0,
 'DATING': 264310807.0,
 'EDUCATION': 871452000.0,
 'ENTERTAINMENT': 2869160000.0,
 'EVENTS': 15973161.0,
 'FAMILY': 10258263505.0,
 'FINANCE': 876648734.0,
 'FOOD_AND_DRINK': 273898751.0,
 'GAME': 35086024415.0,
 'HEALTH_AND_FITNESS': 1583072512.0,
 'HOUSE_AND_HOME': 168712461.0,
 'LIBRARIES_AND_DEMO': 62995910.0,
 'LIFESTYLE': 537643539.0,
 'MAPS_AND_NAVIGATION': 724281890.0,
 'MEDICAL': 53257437.0,
 'NEWS_AND_MAGAZINES': 7496317760.0,
 'PARENTING': 31521110.0,
 'PERSONALIZATION': 2325494782.0,
 'PHOTOGRAPHY': 10088247655.0,
 'PRODUCTIVITY': 14176091369.0,
 'SHOPPING': 3247848785.0,
 'SOCIAL': 14069867902.0,
 'SPORTS': 1751174498.0,
 'TOOLS': 11452771915.0,
 'TRAVEL_AND_LOCAL': 6868887146.0,
 'VIDEO_PLAYERS': 6222002720.0,
 'WEATHER': 426100520.0}

In [52]:
sorted_android_genre_freqs = sorted(android_genre_freqs, key=lambda x: x[1], reverse=True)

In [53]:
sorted_android_genre_freqs

['EVENTS',
 'AUTO_AND_VEHICLES',
 'BUSINESS',
 'TRAVEL_AND_LOCAL',
 'PRODUCTIVITY',
 'ART_AND_DESIGN',
 'SPORTS',
 'HOUSE_AND_HOME',
 'BOOKS_AND_REFERENCE',
 'SOCIAL',
 'COMICS',
 'FOOD_AND_DRINK',
 'TOOLS',
 'COMMUNICATION',
 'ENTERTAINMENT',
 'FINANCE',
 'LIBRARIES_AND_DEMO',
 'VIDEO_PLAYERS',
 'LIFESTYLE',
 'SHOPPING',
 'PHOTOGRAPHY',
 'HEALTH_AND_FITNESS',
 'BEAUTY',
 'WEATHER',
 'NEWS_AND_MAGAZINES',
 'PERSONALIZATION',
 'MEDICAL',
 'EDUCATION',
 'DATING',
 'MAPS_AND_NAVIGATION',
 'GAME',
 'FAMILY',
 'PARENTING']