# Analysis of free apps in the Google Play store and App Store

* This project will analyze data of free Android and Apple apps
* Goal of the project- to help developers understand what types of apps are likely to attract more users and which will generate the most advertising revenue

In [None]:
openapplestorefile = open('AppleStore.csv')
opengooglestorefile = open('googleplaystore.csv')

from csv import reader
apple_read_file = reader(openapplestorefile)
apple_apps_data = list(apple_read_file)

from csv import reader
google_read_file = reader(opengooglestorefile)
google_apps_data = list(google_read_file)

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(apple_apps_data, 0, 4, rows_and_columns=True)
print('\n')
explore_data(google_apps_data, 0, 4, rows_and_columns=True)

# Deleting inaccurate data


There is 1 row in the Google Play Store dataset that is missing information. The following code deletes that row.

In [None]:
print(len(google_apps_data))
del google_apps_data[10473]
print(len(google_apps_data))

The following code checks to see if the App Store dataset has any rows whose length deviates from the header row.

In [None]:
apple_header = apple_apps_data[0]

for row in apple_apps_data:
    if len(row) != len(apple_header):
        print(row)

# Deleting duplicate data

The Google Play store data has duplicate data. Below is a sample of some of the duplicate rows found in the dataset.

In [None]:
duplicate_apps = []
unique_apps = []

for row in google_apps_data[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('Number of unique apps: ', len(unique_apps))
print('\n')
print('Sample of duplicate data: ', duplicate_apps[0:4])
print('\n')
print('Expected length of dataset with duplicates removed: ', (len(unique_apps) - 1181))

The duplicate data will not be deleted randomly. The apps data with the highest number of user reviews will be kept and the remaining apps data deleted. This allows us to keep the most up-to-date data in our dataset.

The code below creates a dictionary of the highest amounts of user reviews for each unique app in the dataset.

In [None]:
reviews_max = {}

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    
print('The length of remove_max dictionary is: ', len(reviews_max))

The code below identifies the data for each app in the dataset that contains the highest number of reviews. The entire row for the data with the highest number of reviews is added to the android_clean list to create a list of lists. Then the name of the each app from android_clean is added to already_added. This eliminates duplicate data from our dataset.

In [None]:
android_clean = []
already_added = []

for row in google_apps_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print('Sample of android_clean list: ', android_clean[0:4])
print('\n')
print('Sample of already_added list: ', already_added[0:5])
print('\n')
print('The length of android_clean list is: ', len(android_clean))

For the purpose of these datasets, we are only interested in identifying apps whose names are written in English. In the code below, we use a loop to identify if the characters in a string are in English based on their assigned ASCII numbers.

In [22]:
def special_characters(string):
    for character in string:
        if ord(character) > 127:
            return False
        else:
            return True
        
print('Is Instagram in English?: ', special_characters('Instagram'))
print('Is 爱奇艺PPS -《欢乐颂2》电视剧热播 in English?: ', special_characters('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print('Is Docs To Go™ Free Office Suite in English?: ', special_characters('Docs To Go™ Free Office Suite'))
#print('Is Instachat 😜 in English?: ', special_characters('Instachat 😜')


Is Instagram in English?:  True
Is 爱奇艺PPS -《欢乐颂2》电视剧热播 in English?:  False
Is Docs To Go™ Free Office Suite in English?:  True
