First, we'll create two objects, one for each dataset.

In [7]:
# Dataset class
class AndroidDataset():
    
    def __init__(self, row):
        self.app      = row[0]
        self.category   = row[1]
        self.rating    = row[2]
        self.reviews = row[3]
        self.size = row[4]
        self.installs = row[5]
        self.type = row[6]
        self.price = row[7]
        self.content_rating = row[8]
        self.genres = row[9]
        self.last_updated = row[10]
        self.current_ver = row[11]
        # self.android_ver = row[12]

# Read the dataset and create App instances
android = []
import csv
with open('googleplaystore.csv', encoding='utf8') as f:
    reader = csv.reader(f)
    rows = list(reader)
    for row in rows[1:]:
        android_app = AndroidDataset(row)
        android.append(android_app)

# Solution testing
first = android[0]
print(first.app)
print(first.category)
print(first.rating)
print(first.reviews)
print(first.current_ver)

# Number of rows
print('Number of rows:', len(android))

Photo Editor & Candy Camera & Grid & ScrapBook
ART_AND_DESIGN
4.1
159
1.0.0
Number of rows: 10841


In [14]:
# Dataset class
class IosDataset():
    
    def __init__(self, row):
        self.id      = row[0]
        self.track_name   = row[1]
        self.size_bytes    = row[2]
        self.currency = row[3]
        self.price = row[4]
        self.rating_count_tot = row[5]
        self.rating_count_ver = row[6]
        self.user_rating = row[7]
        self.user_rating_ver = row[8]
        self.ver = row[9]
        self.cont_rating = row[10]
        self.prime_genre = row[11]
        #self.sup_devices.num = row[12]
        #self.ipadSc_urls.num = row[13]
        #self.lang.num = row[14]
        #self.vpp_lic = row[15]

# Read the dataset and create App instances
ios = []
import csv
with open('AppleStore.csv', encoding='utf8') as f:
    reader = csv.reader(f)
    rows = list(reader)
    for row in rows[1:]:
        ios_app = IosDataset(row)
        ios.append(ios_app)

# Solution testing
first = ios[0]
print(first.id)
print(first.track_name)
print(first.size_bytes)
print(first.currency)
print(first.price)
# Number of rows
print('Number of rows:', len(ios))

284882215
Facebook
389879808
USD
0.0
Number of rows: 7197


The Google Play dataset has a dedicated <a href="https://www.kaggle.com/lava18/google-play-store-apps/discussion" target="_blank">discussion section</a>, and we can see that one of the discussions describes an error for <a href="https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015" target="_blank">row 10472</a>. In this row, there's no category for the app, which leads to a column missing. Because of that, we'll delete this row.

In [19]:
print(len(android))
#del(android[10472])
print(len(android))

10840
10840


There's also some apps with more than one entry in the dataset, so we'll need to remove the duplicates in those situations. We can confirm this information in the code below:

In [47]:
duplicate_apps = []
unique_apps = []
i = 0

for app in android:
    app_object = android[i]
    name = app_object.app
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
    i += 1
    
print('Number of duplicate apps:', len(duplicate_apps))

Number of duplicate apps: 1181


In [58]:
j = 0

for app in android:
    app_object = android[j]
    name = app_object.app
    if name == 'Instagram':
        print(app_object.reviews)
    j += 1

66577313
66577446
66577313
66509917


In the code above, we can see that the number of reviews are different: 66577313, 66577446, 66577313 and 66509917. The higher the number of reviews, the more recent the data should be, so we'll try and collect the more recent ones.

To remove the duplicates, we will do the following:

* Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.
* Use the information stored in the dictionary and create a new dataset, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [61]:
reviews_max = {}
j = 0

for app in android:
    app_object = android[j]
    name = app_object.app
    n_reviews = float(app_object.reviews)
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
    j += 1

print('Expected length, without the duplicates:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length, without the duplicates: 9659
Actual length: 9659


Now, we'll use the dictionary created above to remove the duplicate rows.

In [64]:
android_clean = []
already_added = []
j = 0

for app in android:
    app_object = android[j]
    name = app_object.app
    n_reviews = float(app_object.reviews)
    
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
    
    j += 1

print('Our new dataset has', len(android_clean), 'rows.')

Our new dataset has 9659 rows.


In the previous step, we managed to remove the duplicate app entries in the Google Play dataset. Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience. We're not interested in keeping these apps, so we'll remove them. We can do that by using the ASCII system, in which the characters we commonly use in an English text are all in the range 0 to 127.

In [90]:
def string_english(string):
    aux = 0
    for char in string:
        if ord(char) > 127:
            aux +=1
    if aux > 3:
        return False
    else:
        return True
        
#Checking if the function is working properly
print(string_english('Instagram'))
print(string_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(string_english('Docs To Go™ Free Office Suite'))
print(string_english('Instachat 😜'))



True
False
True
True


We can now use the function *string_english* to filter out non-English apps from both datasets.

In [97]:
android_english = []
ios_english = []
j = 0
k = 0

for app in android_clean:
    app_object = android[j]
    name = app_object.app
    
    if string_english(name):
        android_english.append(app)
    
    j += 1

for app in ios:
    app_object = ios[k]
    name = app_object.track_name
    
    if string_english(name):
        ios_english.append(app)
    
    k += 1

print('Number of rows:', len(android_english))
print('Number of rows:', len(ios_english))

Number of rows: 9620
Number of rows: 6183


As the last step of our data cleaning process, we'll isolate the free apps.

In [99]:
android_free = []
ios_free = []
j = 0
k = 0

for app in android_english:
    app_object = android[j]
    type_app = app_object.type
    
    if type_app == 'Free':
        android_free.append(app)
    
    j += 1

for app in ios_english:
    price = app.price
    
    if price == '0.0':
        ios_free.append(app)
    
    k += 1

print(len(android_free))
print(len(ios_free))

8902
3222


We now have 8902 apps in Google Play and 3222 apps in App Store that we can use in our analysis.

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Our end goal is to add the app on both the App Store and Google Play, so we need to find app profiles that are successful on both markets. 

Let's begin the analysis by getting a sense of the most common genres for each market. We'll do that by creating an object for frequency tables.

In [None]:
class FreqTable():
    def __init__(self):
        self.count = {}
    
    def add(self, element):
        if not element in self.count:
            self.count[element] = 0
        self.count[element] += 1
        return self.count[element]
    
    def get_count(self, element):
        if element not in self.count:
            return 0
        return self.count[element]
    
freq_table = FreqTable()