# Practice - Explorin Datasets without Pandas

The following code is mostly based on exercises from Dataquest. The main purpose of this notebook is to get comfortable with exploring several datasets without using pandas.

#### Business Problem: The aim of this analysis is to help developers understand what type of apps are likely to attract more users on Google Play and the App Store

The following datasets are available online:
  - Apple: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps
  - Google: https://www.kaggle.com/lava18/google-play-store-apps

## Data Exploration

In [90]:
# open the 2 files
opened_file_apple = open('AppleStore.csv', encoding='utf8')
opened_file_google = open('googleplaystore.csv', encoding = 'utf8')

from csv import reader
read_file = reader(opened_file_apple)
read_file_g = reader(opened_file_google)

apple_data = list(read_file)
apple_header = apple_data[0]
apple = apple_data[1:]

google_data = list(read_file_g)
google_header = google_data[0]
google = google_data[1:]

In [91]:
# open and explore documents
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

### Apple Data

In [101]:
explore_data(apple,2,3,4)

['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 7197
Number of columns: 17


In [100]:
#header
print(apple_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Google data

In [110]:
explore_data(google,2,3,4)

['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


In [111]:
# header google
print(google_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


### Summary

According to the two datasets, we seem to have more datapoints for Google than Apple (7198 and 10842, respectively). When comparing the two datasets, some features will definetely contribute more to the purpose of this study: which apps customers prefer.
On Apple: Track Name, Price, Rating_count_lot, Prime_genre
On Google: App, Rating, Price, Category, Install, Genres

## Data Cleaning

Perform:
    - remove wrong lines of data
    - remove duplicated values
    - remove non free apps
    - remove non-english characters

### Google

In [112]:
# running a for loop to get the header and length of data
for row in google:
    header_length = len(google_header)
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(google.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10472


In [113]:
# inpect rows to see if there is any problem
print(google_header)
print('\n')
print(google[10472])
print('\n')
print(google[10471])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


In [114]:
# information in row 10472 does not match with the header - remove it
print(len(google))
del google[10472]
print(len(google))

10841
10840


In [129]:
# remove duplicated values
duplicated_apps = []
unique_apps = []

for app in google:
    name = app[0]
    #print(name)
    if name in unique_apps:
        duplicated_apps.append(name)
    else:
        unique_apps.append(name)

print('Duplicated:',duplicated_apps[:10])
print('\n')
print('Duplicated:',len(duplicated_apps))
print('Unique:',len(unique_apps))

Duplicated: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Duplicated: 1181
Unique: 9659


In [141]:
for app in google:
    name = app[0]
    if name == 'Quick PDF Scanner + OCR FREE':
        print(app)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


### Analysis


The best way to remove the duplicated values is to look at column 4 and keep the highest value. This value corresponds to the number of reviews given to the app (the highest value is most likely to be the latest one)

### Apple

In [106]:
for row in apple:
    header_length = len(apple_header)
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(apple.index(row))

In [108]:
print(row)
print(apple.index(row))

['11097', '1188375727', 'Escape the Sweet Shop Series', '90898432', 'USD', '0', '3', '3', '5', '5', '1.0', '4+', 'Games', '40', '0', '2', '1']
7196


In [68]:
import re
def is_english(string):
    # korean
    if re.search("[\uac00-\ud7a3]",string):
        return False
    # japanese
    if re.search("[\u3040-\u30ff]", string):
        return False
    # chinese
    if re.search("[\u4e00-\u9FFF]",string):
        return False
    

In [80]:
google_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']