# Analysis of Mobile Data
The aim of this project is to analyse data to help app developers understand what type of apps are likely to attract more users. 

This is beneficial in the world of app development as many apps are free to download, the main source of revenue for an app is the ads. The revenue from ads is strongly interlinked with the number of users. 



## Importing Data

In [32]:
open_file = open('AppleStore.csv')
from csv import reader
read_file = reader(open_file)
apple_apps_data = list(read_file)
apple_header = apple_apps_data[0]
apple_data = apple_apps_data[1:]

open_file = open('googleplaystore.csv')
read_file = reader(open_file)
google_apps_data = list(read_file)
google_header = google_apps_data[0]
google_data = google_apps_data[1:]

In [33]:
def explore_data(dataset, start, end, rows_and_columns = False):
    data_slice = dataset[start:end]
    index=0
    for row in data_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

In [34]:
print(apple_header)
print('\n')
explore_data(apple_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of columns:  16


The Apple apps have 7197 apps and 16 columns, useful columns may be `track_name`, `currency`, `price`, `rating_count_tot`, `rating_count_ver` and `prime_genre`. For more information on the column meanings, see documentation.  

In [35]:
print(google_header)
print('\n')
explore_data(google_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of columns:  13


We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are `App`, `Category`, `Reviews`, `Installs`, `Type`, `Price`, and `Genres`
.

## Data Cleaning

Before beginning our analysis, we need to make sure the data we analyze is accurate, or the results of our analysis will be wrong. This means that we need to do the following:

* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

Recall that at our company, we only build apps that are free to download and install, and we design them for an English-speaking audience. This means that we'll need to do the following:

* Remove non-English apps like 爱奇艺PPS -《欢乐颂2》电视剧热播.
* Remove apps that aren't free.


The below datarow is missing a rating value. 

In [39]:
print(google_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [41]:
del google_data[10472]

Not the errorenous row has been removed, the size of the dataset has reduced by 1. 

In [42]:
print(len(google_data))

10840


## Removing Duplicate Entries

From some explorarion, we see there are four entries for 'instagram' in the Google apps dataset 

In [45]:
for row in google_data:
    if row[0] == 'Instagram':
        print(row)
        print('\n')

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




We suspect there are more duplicates, lets check. 

In [46]:
duplicate_apps = []
unique_apps = []

for row in google_data:
    app = row[0]
    if app in unique_apps:
        duplicate_apps.append(app)
    else:
        unique_apps.append(app)
        
print('Number of unique apps: ', len(unique_apps))
print('Number of duplicate apps: ', len(duplicate_apps))

Number of unique apps:  9659
Number of duplicate apps:  1181


In the google dataset alone, there are 1181 duplicate apps, examples of some of these are;

In [47]:
print(duplicate_apps[0:5])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


For duplicate entries, the only difference is between the total number of rating. To keep the most up to date records, we will delete all duplicates with a total rating less than the maximum total rating for that app. 

Here I loop through the data, adding the app name to a dictionary if it is unique. I then update the 

In [55]:
reviews_max = {}

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] =  n_reviews
        

In [57]:
print(len(reviews_max))

9659


Now, we have a dictionary of unique app names and their highest user rating value

We create 2 lists, 
Looping throught the data, if the app name has not been looped before and the ratings for the app are the maxmimum possible ratings, then we add the row corresponding to that app to the list of clean data

The list of clean data should have 9659 rows, this is the same length as the dictionary created above. 

In [60]:
google_clean = []
already_added = []

for row in google_data:
    name = row[0]
    n_reviews = float(row[3])
    
    if n_reviews == reviews_max[name] and name not in already_added:
        google_clean.append(row)
        already_added.append(name)
        
print(google_clean[:5])
print(len(google_clean))

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]
9659


**Removing non-English apps**

I will create a function to extract the ASCII values of characters in app names, if the character is greater than 127 it is not part of the english language. 

The function will return false if any character in the app name contains one of these non-english characters. 

In [101]:
def is_english(string_test):
    for character in string_test:
        if ord(character) > 127:
             return False
    return True


The function is useful, however we see it incorrectly identifies the apps "Docs To Go™ Free Office Suite" and "Instachat 😜" as non-english. 

In [102]:

print(is_english('Docs To Go™ Free Office Suite'))

print(is_english('Instachat 😜'))

False
False


A solution to this is to only remove names that have more than 3 non-english characters. 

In [103]:
def is_english(string_test):
    count = 0
    for character in string_test:
        if ord(character) > 127:
             count +=1
        if count > 3:
            return False
    return True


Now we have a solution to the 2 cases below. 

In [104]:

print(is_english('Docs To Go™ Free Office Suite'))

print(is_english('Instachat 😜'))

True
True


In [108]:
google_english = []

for row in google_clean:
    if is_english(row[0]):
        google_english.append(row)
        
        
apple_english = [] 

for row in apple_data:
    if is_english(row[1]):
        apple_english.append(row)

The before and after values after removing non-english apps from Google data:

In [114]:
print(len(google_clean), len(google_english))

9659 9614


The before and after values after removing non-english apps from Google data:

In [112]:
print(len(apple_data), len(apple_english))


7197 6183


**Isolating Free Apps** 
As our analysis only considers free apps. 

In [144]:
google_free = []
for row in google_english:
    price = row[7]
    if price == '0':
        google_free.append(row)
        
        
apple_free = []
for row in apple_english:
    price = row[4]
    if price == '0.0':
         apple_free.append(row)
    

There are much less data rows in our sample after removing the non-free apps. 

In [145]:
print(len(apple_free))

3222


In [146]:
print(len(google_free))

8864
