# Free Mobile Apps Analysis
Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.    

The purpose for me is to practice some Python skills and use this project to start my portifolio.

In [2]:
# reading the Apple Store apps review file and storing on a data set (ds_apple)
from csv import reader
file_apple = open('AppleStore.csv')
rd_apple = reader(file_apple)
ds_apple = list(rd_apple)

In [27]:
# reading the Google Play apps review and storing on a data set (ds_apple)
file_google = open('googleplaystore.csv')
rd_google = reader(file_google)
ds_google = list(rd_google)

In [4]:
# Function to display the rows in a better formating
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
# displaying the first 3 rows from Google
explore_data(ds_google, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [6]:
# displaying the first 3 rows from Apple
explore_data(ds_apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


Google selected columns to be used in our analysis:
    * App, Category, Rating, Reviews, Size, Installs, 
    * Type, Price, Content Rating, Genres, Last Update

Apple selected columns to be used in our analysis:
    * track_name, size_bytes, price, rating_count_tot 
    * rating_count_ver, user_rating, cont_rating, prime_genre


In [30]:
# Checking the problem about the missing column, second column (category).
# Problem detected in the kaggle forum thread: 
# https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015
print(ds_google[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [31]:
# function used to detect row with missing columns, comparing the lenght of the row with the header.
for row in ds_google:
    header_length = len(ds_google[0])
    rowlength = len(row) 
    if rowlength != header_length:
        print(row)
        print(ds_google.index(row))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10473


In [32]:
# removing the row with problem (another approach would be find the missing value)
del ds_google[10473]

In [33]:
# row deleted
print(ds_google[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [43]:
# checking duplicate rows (google apps with same names with duplicate reviews)
def duplicate_apps(data, vendor='google'):
    dup_apps = []
    uni_apps = []
    
    if vendor == 'google':
        app_col = 0
    elif vendor == 'apple':
        app_col = 1
    else:  
        print('Error, unknown vendor.')
        return

    for app in data:
        name = app[app_col]
        if name in uni_apps:
            dup_apps.append(name)
        else:
            uni_apps.append(name)
    
    print('Duplicate apps count: ', len(dup_apps))
    print('\n')
    print('# Example of duplicate apps:')
    for x in dup_apps[:10]:
        print(x)



In [44]:
duplicate_apps(ds_google)

Duplicate apps count:  1181


# Example of duplicate apps:
Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings
Box
Zenefits
Google Ads
Google My Business
Slack


In [49]:
def get_app_by_name(name, data, vendor='google'):
    # return rows from apps list matching a app name
    if vendor == 'google':
        app_col = 0
    elif vendor == 'apple':
        app_col = 1
    else:  
        print('Error, unknown vendor.')
        return
    
    for row in data:
        app = row[app_col]
        if name == app:
            print(row)
            

In [50]:
get_app_by_name('Slack', ds_google, 'google')

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


We can see in duplicate app above, that the 4th column is different, is the number of ratings count.
We'll consider the max rating_count value as the most updated row, so we can remove the other rows.

In [59]:
# storing the max reviews number for each app in a dictionary
reviews_max = {}
for row in ds_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if (name in reviews_max and reviews_max[name] < n_reviews) or not(name in reviews_max):  
        reviews_max[name] = n_reviews

print(len(reviews_max))

    

9659


For each app from Google, it will check if is the app row has the max num of reviews.
If is the max, this is the most updated row for the app, it will be added to a clean apps list, without duplicates

In [60]:
# for each app check if is the row with the max num of reviews and add to a clean list, without duplicate
google_clean = []
already_added = []
for row in ds_google[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and not(name in already_added):
        google_clean.append(row)
        already_added.append(name)
    

In [62]:
# Checking the clean list from google apps:
explore_data(google_clean, 0, 5, True)        


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


Number of rows: 9659
Number of columns: 13
