### What this project is about:
Analysing the app details from Google Play market and Apple App Store. Trying to find what kind of Apps have more traction.

Sources: 
A data set containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. [https://www.kaggle.com/lava18/google-play-store-apps]

A data set containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. [https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps]

### Learnings and my code repository pieces

I am starting from pretty basic skill for practicing. WIll keep on adding better method I discover for taksk to do.

1. Accessing data files in another location
2. Reading csv files in python
3. Finding duplicate entries for a column in a list of lists 
4. list comprehension as filter to chose rows
5. dicts used to develop unique key and histogram
6. Burying hyperlinks! [Clickable_visible_hyperlink](Hidden_landing_URL)

#### Reading files into python lists

Reading files from Python: Compared to lists here, pandas read_csv() returns DataFrame

In [1]:
#Reading files from python
import os
from csv import reader

## I have stored path name in absolute path, needs to change in other machines
path = 'C:\\Users\\btjos\\Documents\\Git_Data\\'

In [2]:
#Google Playstore 
opened_file = open(path + 'googleplaystore.csv', encoding='utf8')
android_file = reader(opened_file)
android = list(android_file)

#AppStore
ios_file = reader(open(path + 'AppleStore.csv', encoding='utf8'))
ios = list(ios_file)

In [3]:
print('Rows in android: ', len(android), '& columns: ', len(android[0]))
print('Rows in ios: ', len(ios), '& columns: ', len(ios[0]))
print(android[0])
print(android[1], '\n')
print(ios[0])
print(ios[1])
# Good descriptive statistics are shown on these data sites

Rows in android:  10842 & columns:  13
Rows in ios:  7198 & columns:  16
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


### Cleaning up data

the mentioned websites have good summary of data. Need to
1. remove noted error rows
2. remove duplicate rows
3. remove non-english and paying apps, as english and free apps only are explored further

In [4]:
print(android[0], '\n')
print(android[10473])
#documented in site, app entry 'Life Made WI-Fi Touchscreen Photo Frame' has wrong rating. deleted.
del android[10473]
print(android[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [5]:
# 17 Jan 2020 - loopiong through lists
unique_apps = []
dup_apps = []
# for row in android:
#     app = row[0]
#     if app in unique_apps:
#         dup_apps.append(app)
#     else:
#         unique_apps.append(app)

# 18 Jan 2020 - can the if loop be fit in a line? Works!!
for row in android:
    app = row[0]
    dup_apps.append(app) if app in unique_apps else unique_apps.append(app)

print('Number of duplicate apps: ', len(dup_apps), '\n')
print('Some of duplicate apps:')
print(dup_apps[:20])

Number of duplicate apps:  1181 

Some of duplicate apps: 

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software', 'MailChimp - Email, Marketing Automation', 'Crew - Free Messaging and Scheduling', 'Asana: organize team projects', 'Google Analytics', 'AdWords Express']


### Duplicate entries for apps in 'android' 

They they differ in number of reviews collected at different points of time, only maximum number of reviews row is retained for further analyses.

A dictionary is used to store app_names and reviews. While cycling through each row, 
1. if already added, then current review is compared with already stored number of reviews. If current record has higher number of reviews then the number of values is updated in the dictionary.
2. if not already added, the row is added to dictionary 


In [6]:
# create a dictionary to collect rows to be retained
reviews_max = {}
for row in android[1:]:
    app_name = row[0]
    n_reviews = float(row[3])
    if (app_name in reviews_max) and (reviews_max[app_name] < n_reviews):
        reviews_max[app_name] = n_reviews
    if app_name not in reviews_max:
        reviews_max[app_name] = n_reviews
print(len(reviews_max))

android_clean = []
already_added = []
for row in android[1:]:
    app_name = row[0]
    n_reviews = float(row[3])
    if (reviews_max[app_name] == n_reviews) and (app_name not in already_added):
        android_clean.append(row)
        already_added.append(app_name)
print(len(android_clean))
already_added[0:5]

9659
9659


['Photo Editor & Candy Camera & Grid & ScrapBook',
 'U Launcher Lite – FREE Live Cool Themes, Hide Apps',
 'Sketch - Draw & Paint',
 'Pixel Draw - Number Art Coloring Book',
 'Paper flowers instructions']

### Removing non-english apps.

For removing non-English apps, a helper function is created that returns if any non-English charecters are present. All English texts are encoded using the ASCII standard with a corresponding number between 0 and 127. 

But some of the app names would include charecters like Go™ or smileys and other charecters. To include them an app name is excluded if has more than 3 non-english charecters. Some genuine apps might be excluded too, but the impact would be minimal.

In [7]:
# helper function
def is_english(string):
    non_ascii = 0
    for charecter in string:
        if ord(charecter) > 127:
               non_ascii +=1
    
    if non_ascii > 3:
        return False 
    else: 
        return True
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


In [8]:
android_english = []
ios_english = []

for row in android_clean:
    if is_english(row[0]):
        android_english.append(row)
print('android_clean rows: ', len(android_clean))
print('android_english rows: ', len(android_english), '\n')
for row in ios:
    if is_english(row[1]):
        ios_english.append(row)
print('ios rows: ', len(ios))
print('ios_english rows: ', len(ios_english))

android_clean rows:  9659
android_english rows:  9614 

ios rows:  7198
ios_english rows:  6184


In [9]:
# how about list comprehension? works!!!! 
#### Explore for dict comprehension too
android_english = [row for row in android_clean if is_english(row[0])]
print('android_clean rows: ', len(android_clean))
print('android_english rows: ', len(android_english), '\n')

ios_english = [row for row in ios if is_english(row[1])]
print('ios rows: ', len(ios))
print('ios_english rows: ', len(ios_english))

android_clean rows:  9659
android_english rows:  9614 

ios rows:  7198
ios_english rows:  6184


### Removing paid apps



In [10]:
android_final = [row for row in android_english if row[7] == '0']
print('android_english rows: ', len(android_english))
print('android_final rows: ', len(android_final), '\n')

ios_final = [row for row in ios_english if row[4] == '0.0']
print('ios_english rows: ', len(ios_english))
print('ios_final rows: ', len(ios_final))

android_english rows:  9614
android_final rows:  8864 

ios_english rows:  6184
ios_final rows:  3222


## What kind of apps attract more people?

Free apps get revenue by adds, hence influenced by number of people installing and reviewing the apps.

A good entry strategy could be:
1. build minimal version and add it to Google Pay, which has majority users
2. Develop further if response is good
3. If successful, launch iOS version.

#### Begin with popular genre types

Andoid fields: 
'App', [1]'Category', 'Rating', [3]'Reviews', 'Size', [5]'Installs', 'Type', 'Price', 'Content Rating', [9]'Genres', 'Last Updated', 'Current Ver', 'Android Ver'


iOS fields:
'id', 'track_name', 'size_bytes', 'currency', 'price', [5]'rating_count_tot', 'rating_count_ver', [7]'user_rating', 'user_rating_ver', 'ver', 'cont_rating', [11]'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'

#### helper functions are developed to generate frequency tables and then display sorted percentages



In [19]:
def freq_table(dataset, index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key]/total) * 100
        table_percentages[key] = percentage
    return table_percentages

# dicts are hard to display sorted, hence put into list as tuples
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [22]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


NoneType

In [23]:
### Explore and develop this further