# Types of Apps with the Most Users

I am analyzing data from the Apple Store to learn which kinds of apps users are most likely to download.  I am doing this imagining that I work for a company that makes free apps and gets its revenue from ads.  I'm hoping to learn which types of apps get the most users and thus which would make the most sense for this company to work on developing.

I'm going to use the Apple Store data set which can be found [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps#AppleStore.csv)
And the Google Play store data set which can be found [here](https://www.kaggle.com/lava18/google-play-store-apps)

The first thing I will do is open both the files and convert them to lists that I can work with

In [11]:
opened_file = open('AppleStore.csv')
from csv import reader
read_file = reader(opened_file)
apple_data = list(read_file)

opened_file2 = open('googleplaystore.csv')
from csv import reader
read_file2 = reader(opened_file2)
google_data = list(read_file2)


# Cleaning the Data
I'm going to start with cleaning up the data.
From the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) about the Google data I see that there is a specific row with a missing piece of data which I will want to delete if that's the case.  To double check I'll print it below (the index number is 10473 instead of 10472 as noted in the discussion because I did not delete the header row from my data set).

In [12]:
google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

So now I will delete this row and then print the new row at the same index number to make sure it was deleted:

In [13]:
del google_data[10473]
google_data[10473]

['osmino Wi-Fi: free WiFi',
 'TOOLS',
 '4.2',
 '134203',
 '4.1M',
 '10,000,000+',
 'Free',
 '0',
 'Everyone',
 'Tools',
 'August 7, 2018',
 '6.06.14',
 '4.4 and up']

There are known duplicate entries in the Google Data set so that is another thing I need to fix while cleaning up the data.  First I will check to see how many duplicates there are: 

In [15]:
duplicate_apps = []
unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of Duplicate Apps', len(duplicate_apps))
print('\n')
print('Examples of Duplicate Apps', duplicate_apps[:10])

Number of Duplicate Apps 1181


Examples of Duplicate Apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


So there are 1,181 duplicates in the data set.  I'm going to look at one of the examples to try to get an idea of which entries to keep and which to delete.

In [18]:
for app in google_data:
    name = app[0]
    if name == 'Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


The only difference between the entries appears to be the 4th number (51507, 51507, and 51510).  This number refers to the number of reviews the app received.  In general I would like to use the most recent data to compare the apps so I will keep the entry with the highest number of reviews and delete the other ones.

First I'm going to figure out how many unique apps there are so I know what number of apps I'm trying to get in my data set (from the code above I learned that there are 1,181 dupliate apps):

In [24]:
print('Expected length of clean data set:', len(google_data[1:]) - 1181)

Expected length of clean data set: 9659


So I expect my clean data set to have 9,659 entries.  I will now create a dictionary where the keys are the names of the apps and the values are the highest number of reviews.  Each app name will be added to the dictionary as the code loops through the data set and the number of reviews will get updated to the highest number.  At the end I will check the length of my dictionary to make sure it is 9,659 like I expect.

In [23]:
reviews_max = {}

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        n_reviews = reviews_max[name]
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Length of Dictionary:', len(reviews_max))

Length of Dictionary: 9659


The dictionary length is what I expected and now I have dictionary with each unique app name and the value for the most ratings.  I'm now going to use this dictionary to loop through the original data set and create a cleaned one with all the duplicate entries removed and only the entires with the highest number of reviews remaining.  Some of the duplicate entries have the exact same number of reviews, but I still only want one entry for each app.  Because of this while I'm looping through the data set I will also be creating a list of the names that have already been added to the data set.  That way entries will only be added if they contain the highest number of reviews AND they are not already in the data set.

When I'm done I will check the length of my clean data to make sure it is 9,659 like I expect.

In [25]:
google_data_clean = []
already_added = []

for app in google_data[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_data_clean.append(app)
        already_added.append(name)

print('Length of Clean Data Set:', len(google_data_clean))

Length of Clean Data Set: 9659


Now that I have a new clean data set with only unique values.

For the analysis I am doing I am only interested in English only apps.  I'm going to write a function that can tell if an app name is English or not.  Because some English app names contains symbols or emojis I'm going to allow apps in the data set with up to 3 non-English characters.  This is not a perfect method but should be good enough for my analysis.  I will test my function on a few app names that I already know are English or not English and see if the function returns the correct values.


In [33]:
def is_english(string):
    non_english = []
    for character in string:
        if ord(character) > 127:
            non_english.append(character)
            if len(non_english) > 3:
                return False
    
    return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Instachat 😜'))
print(is_english('Docs To Go™ Free Office Suite'))

True
False
True
True


My function worked like I expected!  I will now use this function to loop through both the Google and Apple data and create two new clean data sets to work with.  I will then measure the length of each data set to see how many entries are left.


In [35]:
google_data_clean_eng = []
apple_data_eng = []

for app in google_data_clean:
    name = app[0]
    if is_english(name):
        google_data_clean_eng.append(app)
        
for app in apple_data[1:]:
    name = app[0]
    if is_english(name):
        apple_data_eng.append(app)

print(len(google_data_clean_eng))
print(len(apple_data_eng))

9614
7197
