# iOS and Android App Research
* I am putting together datasets to better understand statistics for app development. We'll be looking at data collected from the Google Play and the App Store. 

For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
from csv import reader

# App Store data set
opened_file_ios = open('AppleStore.csv')
read_ios = reader(opened_file_ios)
ios_all_data = list(read_ios)
ios_header = ios_all_data[0]
ios = ios_all_data[1:]

# Google Play data set
opened_file_android = open('googleplaystore.csv')
read_droid = reader(opened_file_android)
droid_all_data = list(read_droid)
droid_header = droid_all_data[0]
droid = droid_all_data[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    if rows_and_columns:
        print(f'Number of rows: {len(dataset)}')
        print(f'Number of columns:{len(dataset[0])}\n')
        
    for row in dataset_slice:
        print(row)
        print('\n')  # adds a new blank line after each row.

In order to find free, user driven apps, funded by ad revenue I believe relevant columns will be:
* name
* price
* user ratings 
* prime genre
* category
* reviews
* genre

In [3]:
print('iOS Header...')
print(ios_header, '\n')
print('iOS Data...')
explore_data(ios, 1, 3, True)

print('Android Header...')
print(droid_header, '\n')
print('Android Data...')
explore_data(droid, 1, 3, True)

iOS Header...
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

iOS Data...
Number of rows: 7197
Number of columns:16

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Android Header...
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Android Data...
Number of rows: 10841
Number of columns:13

['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Laun

**Make sure the length of all rows is even with the length of the `header` column.**

In [4]:
for row in droid:
    if len(row) != len(droid_header):
        print(row)
        print('\n')
        print(f"Index position is {droid.index(row)}")

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


Index position is 10472


In [5]:
# Print the `row` in an easy to read fashion.
problem_row = droid[10472]
for item_head, item_row in zip(droid_header, problem_row):
    print(f"{item_head}: {item_row}")

App: Life Made WI-Fi Touchscreen Photo Frame
Category: 1.9
Rating: 19
Reviews: 3.0M
Size: 1,000+
Installs: Free
Type: 0
Price: Everyone
Content Rating: 
Genres: February 11, 2018
Last Updated: 1.0.19
Current Ver: 4.0 and up


**Looks like this app has a rating of 19 which is not possible. 
After reading the discussion board it seems to be missing an entry for the `Category` column. 
We'll Just delete the whole entry (`row`) for now.**

In [6]:
del droid[10472]

In [7]:
# Print the `row` in an easy to read fashion.
problem_row = droid[10472]
for item_head, item_row in zip(droid_header, problem_row):
    print(f"{item_head}: {item_row}")

App: osmino Wi-Fi: free WiFi
Category: TOOLS
Rating: 4.2
Reviews: 134203
Size: 4.1M
Installs: 10,000,000+
Type: Free
Price: 0
Content Rating: Everyone
Genres: Tools
Last Updated: August 7, 2018
Current Ver: 6.06.14
Android Ver: 4.4 and up


Recall that at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.

## Let's investigate duplicate apps.

In [8]:
duplicate_apps = []
unique_apps = []

for app in droid:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print(f'Number of duplicate app: {len(duplicate_apps)}\n')
print(f'Examples of duplicate apps: {duplicate_apps[:15]}')

Number of duplicate app: 1181

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


**That's 1181 duplicate apps. Let's see if we can find some discrepencies between the entries.**

In [9]:
for app in droid:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


**`Instagram` has multiple entries with varying `Rating` totals. It's safe to assume the higher the `Rating` total the more recent the data. 
Instead of removing duplicates randomly we'll use the `Rating` total column to remove duplicates.**

In [12]:
print(f'Expected length: {len(droid) - 1181}')

Expected length: 9659
