
# iOS and Android App Research

* I am putting together datasets to better understand statistics for app development. We'll be looking at data collected from the Google Play and the App Store. 

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps.  
We make our apps available on Google Play and in the App Store.  


We only build apps that are free to download and install, and our main source of revenue consists of in-app ads.  
This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better.  
Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
from csv import reader

# App Store data set
opened_file_ios = open('AppleStore.csv')
read_ios = reader(opened_file_ios)
ios_all_data = list(read_ios)
ios_header = ios_all_data[0]
ios = ios_all_data[1:]

# Google Play data set
opened_file_android = open('googleplaystore.csv')
read_droid = reader(opened_file_android)
droid_all_data = list(read_droid)
droid_header = droid_all_data[0]
droid = droid_all_data[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    if rows_and_columns:
        print(f'Number of rows: {len(dataset)}')
        print(f'Number of columns:{len(dataset[0])}\n')
        
    for row in dataset_slice:
        print(row)
        print('\n')  # adds a new blank line after each row.


#### In order to find free, user driven apps, funded by ad revenue I believe relevant columns will be:

* name
* price
* user ratings 
* prime genre
* category
* reviews
* genre

In [3]:
print('iOS Header...')
print(ios_header, '\n')
print('iOS Data...')
explore_data(ios, 1, 3, True)

print('Android Header...')
print(droid_header, '\n')
print('Android Data...')
explore_data(droid, 1, 3, True)

iOS Header...
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

iOS Data...
Number of rows: 7197
Number of columns:16

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Android Header...
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Android Data...
Number of rows: 10841
Number of columns:13

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Laun


### Make sure the length of all `droid` rows are even with the length of the `header` column.


In [4]:
for row in droid:
    if len(row) != len(droid_header):
        print(row)
        print('\n')
        print(f"Index position is {droid.index(row)}")

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index position is 10472



#### Print the `row` in an easy to read fashion.


In [5]:
problem_row = droid[10472]
for item_head, item_row in zip(droid_header, problem_row):
    print(f"{item_head}: {item_row}")

App: Life Made WI-Fi Touchscreen Photo Frame
Category: 1.9
Rating: 19
Reviews: 3.0M
Size: 1,000+
Installs: Free
Type: 0
Price: Everyone
Content Rating: 
Genres: February 11, 2018
Last Updated: 1.0.19
Current Ver: 4.0 and up


**Looks like this app has a rating of 19 which is not possible.  
After reading the discussion board it seems to be missing an entry for the `Category` column.  
We'll Just delete the whole entry (`row`) for now.**

In [6]:
del droid[10472]

In [7]:
# Print the `row` in an easy to read fashion.
problem_row = droid[10472]
for item_head, item_row in zip(droid_header, problem_row):
    print(f"{item_head}: {item_row}")

App: osmino Wi-Fi: free WiFi
Category: TOOLS
Rating: 4.2
Reviews: 134203
Size: 4.1M
Installs: 10,000,000+
Type: Free
Price: 0
Content Rating: Everyone
Genres: Tools
Last Updated: August 7, 2018
Current Ver: 6.06.14
Android Ver: 4.4 and up



### Make sure the length of all `ios` rows are even with the length of the `header` column.

In [8]:
for row in ios:
    if len(row) != len(ios_header):
        print(row)
        print('\n')
        print(f"Index position is {ios.index(row)}")

**Looks like we're good.**


#### Reminder:

**Recall that at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience.  
This means that we'll need to do the following:**

* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.

## Investigate Duplicate Apps.

### Google Play Duplicate Data Check.

In [9]:
duplicate_apps = []
unique_apps = []

for app in droid:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print(f'Number of duplicate apps: {len(duplicate_apps)}\n')
print(f'Number of unique apps: {len(unique_apps)}\n')
print(f'Examples of duplicate apps: {duplicate_apps[:15]}')

Number of duplicate apps: 1181

Number of unique apps: 9659

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']



##### That's 1181 duplicate apps. Let's see if we can find some discrepencies between the entries.


In [10]:
for app in droid:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']



#### Use Rating Totals to ID duplicates.
`Instagram` has multiple entries with varying `Rating` totals.  
It's safe to assume the higher the `Rating` total the more recent the data.  
Instead of removing duplicates randomly we'll use the `Rating` total column to remove duplicates.

#### Create an empty dictionary, add app name and number of reviews as the value.

In [11]:
reviews_max = {}

for app in droid:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(f'Non-duplicate length {len(reviews_max)}')
print(f'Expected length: {len(droid) - 1181}')

Non-duplicate length 9659
Expected length: 9659


* Create two lists. One for cleaned data and one for the names of apps that were already added to our cleaned list.  
* The lengths of both should be equal.

In [12]:
droid_clean = []
already_added = []

for app in droid:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        droid_clean.append(app)
        already_added.append(name)

print(f'Length of clean Google Play: {len(droid_clean)}')
print(f'Length of already added: {len(already_added)}')


Length of clean Google Play: 9659
Length of already added: 9659



### iOS Duplicate App Check


In [13]:
duplicate_apps = []
unique_apps = []

for app in ios:
    app_id = app[0]
    if app_id in unique_apps:
        duplicate_apps.append(app_id)
    else:
        unique_apps.append(app_id)

print(f'Number of duplicate apps: {len(duplicate_apps)}\n')
print(f'Number of unique apps: {len(unique_apps)}\n')
print(f'Examples of duplicate apps: {duplicate_apps[:15]}')

Number of duplicate apps: 0

Number of unique apps: 7197

Examples of duplicate apps: []


#### Well that's good news, the App Store doesn't have any duplicates. That unique ID column that really helps keep data clean :)


## Check For, and Remove, non-English apps.

Below are a few examples.


In [14]:
print(ios[813][1])
print(ios[6731][1])
print('\n')
print(droid_clean[4412][0])
print(droid_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


中国語 AQリスニング
لعبة تقدر تربح DZ


We can iterate through the string and identify which characters are not part of the english text characters.  
Let's define a function to check characters in a string for non-English characters.

In [15]:
def english_checker(a_string):
    for char in a_string:
        if ord(char) > 127:
            return False
        
    return True
        
print(english_checker(ios[813][1]))
print(english_checker(ios[6731][1]))
print('\n')
print(english_checker(droid_clean[4412][0]))
print(english_checker(droid_clean[7940][0]))
print('\n')
print(english_checker('Docs To Go™ Free Office Suite'))
print(english_checker('Instachat 😜'))

False
False


False
False


False
False


Wait a second, the trademark symbol and emoji should count. Those apps are cleary geared toward English speakers.  
We need to modify that function a bit to account for a few non-ASCII characters like emojis and what not.

In [16]:
def english_checker(a_string):
    non_ascii = 0
    for char in a_string:
        if ord(char) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:    
        return True
        
print(english_checker(ios[813][1]))
print(english_checker(ios[6731][1]))
print('\n')
print(english_checker(droid_clean[4412][0]))
print(english_checker(droid_clean[7940][0]))
print('\n')
print(english_checker('Docs To Go™ Free Office Suite'))
print(english_checker('Instachat 😜'))

False
False


False
False


True
True


This seems to work pretty well. It's not perfect but it should clean out most of the non-English apps.  
Let's go ahead and store these cleaned up, English apps in a new list.  
Don't forget to use `droid_clean` as we have already spent time cleaning up a bit of the Google Play data.

In [27]:
ios_english = []
ios_non_english = []

for app in ios:
    name = app[1]
    if english_checker(name):
        ios_english.append(app)
    else:
        ios_non_english.append(app)

print(f'Length of English: {len(ios_english)}')
print(f'Length of non-English: {len(ios_non_english)}')
print(f'Total length: {len(ios_english) + len(ios_non_english)}\n') 


droid_english = []
droid_non_english = []

for app in droid_clean:
    name = app[0]
    if english_checker(name):
        droid_english.append(app)
    else:
        droid_non_english.append(app)
         
            
print(f'Length of English: {len(droid_english)}')
print(f'Length of non-English: {len(droid_non_english)}')
print(f'Total length: {len(droid_english) + len(droid_non_english)}') 

# explore_data(droid_english, 0, 3, True)
# for name in droid_english[:25]:
#     print(name[0])

Length of English: 6183
Length of non-English: 1014
Total length: 7197

Length of English: 9614
Length of non-English: 45
Total length: 9659



## Clean Out non-Free Apps

Loop through datasets, identify and remove any non-free apps.  
Prices come up as `strings` so be sure they're not in the conditional statements.

In [28]:
ios_free = []
ios_paid = []

for app in ios_english:
    price = float(app[4])
    if price == 0.0:
        ios_free.append(app)
    else:
        ios_paid.append(app)
        
print(f'Free iOS apps: {len(ios_free)}')
print(f'Paid iOS apps: {len(ios_paid)}')
print()


droid_free = []
droid_paid = []

for app in droid_english:
    price = app[7]
    if price == '0':
        droid_free.append(app)
    else:
        droid_paid.append(app)

print(f'Free Android apps: {len(droid_free)}')
print(f'Paid Android apps: {len(droid_paid)}')
print('\n')

# explore_data(ios_free, 0, 3, True)
# print('\n')
# explore_data(droid_free, 0, 3, True)

Free iOS apps: 3222
Paid iOS apps: 2961

Free Android apps: 8864
Paid Android apps: 750





## App Profile Strategy

In order to minimize our risk and costs to market we'll use three steps:
1. Build a minimal version of an Android app.
2. If the app has a good response we can develop it further.
3. If it is profitable afte 6 months develop an iOS version for the App Store.

For our app dev purposes we want to find something that is popular on both the App Store and Google Play. We'll look at genre occurence frequency to identify which genres are the most popular.  
Let's make a genre frequency table to start.

In [37]:
genre_frequency_ios = {}

for app in ios_free:
    genre = app[11]
    if genre in genre_frequency_ios:
        genre_frequency_ios[genre] += 1
    else:
        genre_frequency_ios[genre] = 1

genre_sorted = sorted(genre_frequency_ios.items(), key=lambda x: x[1], reverse=True)

for i in genre_sorted:
    print(f'{i[0]}: {i[1]}')

Games: 1874
Entertainment: 254
Photo & Video: 160
Education: 118
Social Networking: 106
Shopping: 84
Utilities: 81
Sports: 69
Music: 66
Health & Fitness: 65
Productivity: 56
Lifestyle: 51
News: 43
Travel: 40
Finance: 36
Weather: 28
Food & Drink: 26
Reference: 18
Business: 17
Book: 14
Navigation: 6
Medical: 6
Catalogs: 4


It looks like Games dominates. Not much of a surprise there. Entertainment could also be an evenue worth exploring.  
Let's check the Google Play info. There is a column for `category` as well as `genre`. Let's run them separately and see if how much similarity there is.

In [35]:
category_frequency_droid = {}
genre_frequency_droid = {}

for app in droid_free:
    category = app[1]
    if category in category_frequency_droid:
        category_frequency_droid[category] += 1
    else:
        category_frequency_droid[category] = 1
        
for app in droid_free:
    genre = app[9]
    if genre in genre_frequency_droid:
        genre_frequency_droid[genre] += 1
    else:
        genre_frequency_droid[genre] = 1
        
sort_cat_freq_droid = sorted(category_frequency_droid.items(), key=lambda x: x[1], reverse=True)
sort_genre_freq_droid = sorted(genre_frequency_droid.items(), key=lambda x: x[1], reverse=True)

for i in sort_cat_freq_droid[:10]:
    print(f'{i[0]}: {i[1]}')
print('\n')
for i in sort_genre_freq_droid[:10]:
    print(f'{i[0]}: {i[1]}')   

FAMILY: 1676
GAME: 862
TOOLS: 750
BUSINESS: 407
LIFESTYLE: 346
PRODUCTIVITY: 345
FINANCE: 328
MEDICAL: 313
SPORTS: 301
PERSONALIZATION: 294


Tools: 749
Entertainment: 538
Education: 474
Business: 407
Lifestyle: 345
Productivity: 345
Finance: 328
Medical: 313
Sports: 307
Personalization: 294


<class 'list'>
