# Analysis of free apps in the Google Play store and App Store

* This project will analyze data of free Android and Apple apps
* Goal of the project- to help developers understand what types of apps are likely to attract more users and which will generate the most advertising revenue

In [1]:
openapplestorefile = open('AppleStore.csv')
opengooglestorefile = open('googleplaystore.csv')

from csv import reader
apple_read_file = reader(openapplestorefile)
apple_apps_data = list(apple_read_file)

from csv import reader
google_read_file = reader(opengooglestorefile)
google_apps_data = list(google_read_file)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_apps_data, 0, 4, rows_and_columns=True)
print('\n')
explore_data(google_apps_data, 0, 4, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 

# Deleting inaccurate data


There is 1 row in the Google Play Store dataset that is missing information. The following code deletes that row.

In [2]:
print(len(google_apps_data))
del google_apps_data[10473]
print(len(google_apps_data))

10842
10841


The following code checks to see if the App Store dataset has any rows whose length deviates from the header row. The code does not return any rows, so we know that there is no missing data.

In [3]:
apple_header = apple_apps_data[0]

for row in apple_apps_data:
    if len(row) != len(apple_header):
        print(row)

# Deleting duplicate data

The Google Play store data has duplicate data. Below is a sample of some of the duplicate rows found in the dataset.

In [4]:
duplicate_apps = []
unique_apps = []

for row in google_apps_data[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Number of unique apps: ', len(unique_apps))
print('\n')
print('Sample of duplicate data: ', duplicate_apps[0:4])
print('\n')
print('Expected length of dataset with duplicates removed: ', (len(unique_apps) - 1181))

Number of duplicate apps:  1181


Number of unique apps:  9659


Sample of duplicate data:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']


Expected length of dataset with duplicates removed:  8478


The duplicate data will not be deleted randomly. The apps data with the highest number of user reviews will be kept and the remaining apps data deleted. This allows us to keep the most up-to-date data in our dataset.

The code below creates a dictionary of the highest amounts of user reviews for each unique app in the dataset.

In [5]:
reviews_max = {}

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    
print('The length of remove_max dictionary is: ', len(reviews_max))

The length of remove_max dictionary is:  9659


The code below identifies the data for each app in the dataset that contains the highest number of reviews. The entire row for the data with the highest number of reviews is added to the android_clean list to create a list of lists. Then the name of the each app from android_clean is added to already_added. This eliminates duplicate data from our dataset.

In [6]:
android_clean = []
already_added = []

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print('Sample of android_clean list: ', android_clean[0:4])
print('\n')
print('Sample of already_added list: ', already_added[0:5])
print('\n')
print('The length of android_clean list is: ', len(android_clean))

Sample of android_clean list:  [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


Sample of already_added list:  ['Photo Editor & Candy Camera & Grid & ScrapBook', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book', 'Paper flowers instruct

For the purpose of these datasets, we are only interested in identifying apps whose names are written in English. In the code below, we use a loop to identify if the characters in a string are in English based on their assigned ASCII numbers.

In [7]:
def special_characters(string):
    number_special_characters = 0
    for character in string:
        if ord(character) > 127:
            number_special_characters += 1
    
    if number_special_characters > 3:
        return False
    else:
        return True
      
print('Is Instagram in English?: ', special_characters('Instagram'))
print('Is 爱奇艺PPS -《欢乐颂2》电视剧热播 in English?: ', special_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('Is Docs To Go™ Free Office Suite in English?: ', special_characters('Docs To Go™ Free Office Suite'))
#print('Is Instachat 😜 in English?: ', special_characters('Instachat 😜')


Is Instagram in English?:  True
Is 爱奇艺PPS -《欢乐颂2》电视剧热播 in English?:  False
Is Docs To Go™ Free Office Suite in English?:  True


For the purpose of this dataset, we have decided that we will only remove apps from our datasets that have more than 3 non-English characters in the app's name. In the code below, we are identifying which app names have more than 3 non-English characters and are separating the data into 2 lists. This will be performed for both the Apple and Android datasets.

In [8]:
english_android_clean = []
non_english_android_clean = []

for row in android_clean:
    name = row[0]
    if special_characters(name) == False:
        non_english_android_clean.append(row)
    else:
        english_android_clean.append(row)

english_apple = []
non_english_apple = []

for row in apple_apps_data[1:]:
    name = row[1]
    if special_characters(name) == False:
        non_english_apple.append(row)
    else:
        english_apple.append(row)
      
print('Length of English Android apps list: ', len(english_android_clean))
print('Length of English Apple apps list: ',len(english_apple))
print('Length of non-English Android apps list: ', len(non_english_android_clean))
print('Length of non-English Apple apps list: ',len(non_english_apple))


Length of English Android apps list:  9614
Length of English Apple apps list:  6183
Length of non-English Android apps list:  45
Length of non-English Apple apps list:  1014


We are interested in identifying which apps are free and which are paid. The code below separates the apps that are free from each dataset. We now have our final lists of apps whose data we will analyze. 

In [9]:
apple_apps_final = []
android_apps_final = []

for row in english_apple:
    price = float(row[4])
    if price == 0.0:
        apple_apps_final.append(row)

for row in english_android_clean:
    price = row[7]
    if price == '0':
        android_apps_final.append(row)
        
print('Length of Free Apple apps list: ', len(apple_apps_final))
print('Length of Free Android apps list: ', len(android_apps_final))
print('Length of paid Apple apps: ', 6183 - len(apple_apps_final))
print('Length of paid Android apps: ', 9614 - len(android_apps_final))
        

Length of Free Apple apps list:  3222
Length of Free Android apps list:  8864
Length of paid Apple apps:  2961
Length of paid Android apps:  750
